0% found this document useful (0 votes)
18 views542 pages

Power Aware Design Methodologies

The document titled 'Power Aware Design Methodologies' is edited by Massoud Pedram and Jan M. Rabaey and published by Kluwer Academic Publishers. It covers various aspects of power-aware design in microelectronics, including power consumption sources, low-power design techniques, and specific methodologies for CMOS devices, memory, digital circuits, and analog design. The book serves as a comprehensive resource for researchers and practitioners in the field of energy-efficient electronic system design.

Uploaded by

dltailieu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views542 pages

Power Aware Design Methodologies

The document titled 'Power Aware Design Methodologies' is edited by Massoud Pedram and Jan M. Rabaey and published by Kluwer Academic Publishers. It covers various aspects of power-aware design in microelectronics, including power consumption sources, low-power design techniques, and specific methodologies for CMOS devices, memory, digital circuits, and analog design. The book serves as a comprehensive resource for researchers and practitioners in the field of energy-efficient electronic system design.

Uploaded by

dltailieu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 542

POWER AWARE DESIGN

METHODOLOGIES
This page intentionally left blank
POWER AWARE DESIGN
METHODOLOGIES

edited by

Massoud Pedram
University of Southern California

and

Jan M. Rabaey
University of California, Berkeley

KLUWER ACADEMIC PUBLISHERS


NEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW
eBook ISBN: 0-306-48139-1
Print ISBN: 1-4020-7152-3

©2002 Kluwer Academic Publishers


New York, Boston, Dordrecht, London, Moscow

Print ©2002 Kluwer Academic Publishers


Dordrecht

All rights reserved

No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,
mechanical, recording, or otherwise, without written consent from the Publisher

Created in the United States of America

Visit Kluwer Online at: http://kluweronline.com


and Kluwer's eBookstore at: http://ebooks.kluweronline.com
Contents

CONTRIBUTORS xvii

PREFACE xix

1. INTRODUCTION 1
MASSOUD PEDRAM AND JAN RABAEY
1.1 INTRODUCTION 1
1.2 SOURCES OF POWER CONSUMPTION 2
1.3 LOW-POWER VERSUS POWER-AWARE DESIGN 2
1.4 POWER REDUCTION MECHANISMS IN CMOS CIRCUITS 3
1.5 POWER REDUCTION TECHNIQUES IN MICROELECTRONIC
SYSTEMS 4
1.6 BOOK ORGANIZATION AND OVERVIEW 5
1.7 SUMMARY 7
2. CMOS DEVICE TECHNOLOGY TRENDS FOR POWER-
CONSTRAINED APPLICATIONS 9
DAVID J. FRANK
2.1 INTRODUCTION 9
2.2 CMOS TECHNOLOGY SUMMARY 11
2.2.1 Current CMOS Device Technology 11
2.2.2 ITRS Projections 13
2.3 SCALING PRINCIPLES AND DIFFICULTIES 15
2.3.1 General Scaling 16
2.3.2 Characteristic Scale Length 18
2.3.3 Limits to Scaling 20
vi Contents

2.3.3.1 Tunnelling Through the Gate Insulator 21


2.3.3.2 Junction Tunnelling 23
2.3.3.3 Discrete Doping Effects 24
2.3.3.4 Thermodynamic Effects 26
2.4 POWER-CONSTAINED SCALING LIMITS 29
2.4.1 Optimizing and 29
2.4.2 Optimizing Gate Insulator Thickness and Gate Length - the
Optimal End to Scaling 30
2.4.3 Discussion of the Optimizations 33
2.5 EXPLORATORY TECHNOLOGY 35
2.5.1 Body- or Back-Gate Bias 35
2.5.2 Strained Si 36
2.5.3 Fully-Depleted SOI 38
2.5.4 Double-gate FET Structures 40
2.5.5 Low Temperature Operation for High Performance 44
2.6 SUMMARY 45
3. LOW POWER MEMORY DESIGN 51
YUKIHITO OOWAKI AND TOHRU TANZAWA
3.1 INTRODUCTION 51
3.2 FLASH MEMORIES 52
3.2.1 Flash Memory Cell Operation and Control Schemes 55
3.2.1.1 NOR Flash Memory 55
3.2.1.2 NAND Flash Memory 59
3.2.2 Circuits Used in Flash Memories 63
3.2.2.1 Charge Pump Circuits 64
3.2.2.2 Level Shifter 68
3.2.2.3 Sense Amplifier 71
3.2.2.4 Effect of the Supply Voltage Reduction on Power 73
3.3 FERROELECTRIC MEMORY 74
3.3.1 Basic Operation of FeRAM 74
3.3.2 Low Voltage FeRAM Design 77
3.3.2.1 Optimization of Bit-line Capacitance 77
3.3.2.2 Cell Plate Line Drive Techniques 77
3.3.2.3 Non-driven Cell Plate Line Scheme 79
3.3.2.4 Other Low Voltage Techniques 81
3.4 EMBEDDED DRAM 82
3.4.1 Advantages of Embedded DRAM 82
3.4.2 Low Voltage Embedded DRAM Design 83
3.5 SUMMARY 85
Contents vii

4. LOW-POWER DIGITAL CIRCUIT DESIGN 91


TADAHIRO KURODA
4.1 INTRODUCTION 91
4.2 LOW VOLTAGE TECHNOLOGIES 92
4.2.1 Variable and 93
4.2.2 Dual ’s 96
4.2.3 Multiple ’s and ’s 98
4.2.3.1 Multiple Power Supplies 98
4.2.3.2 Multiple Threshold Voltages 101
4.2.3.3 Multiple Transistor Width 104
4.2.3.4 Summary 106
4.2.4 Low Voltage SRAM 106
4.3 LOW SWITCHING-ACTIVITY TECHNIQUES 110
4.4 LOW CAPACITANCE TECHNOLOGIES 118
4.5 SUMMARY 118
5. LOW VOLTAGE ANALOG DESIGN 121
K. UYTTENHOVE AND M. STEYAERT
5.1 INTRODUCTION 122
5.1.1 Fundamental Limits to Low Power Consumption 122
5.1.2 Practical Limitations for Achieving the Minimum Power
Consumption 123
5.1.3 Implications of Reduced Supply Voltages 124
5.2 SPEED-POWER-ACCURACY TRADE-OFF IN HIGH SPEED
ADC’s 126
5.2.1 High-speed ADC Architecture 126
5.2.2 Models for Matching in Deep-submicron Technologies. 129
5.2.2.1 What is Transistor Mismatch? 129
5.2.2.2 Transistor Mismatch Modelling 130
5.2.2.3 Speed-power-accuracy Trade-off 134
5.3 IMPACT OF VOLTAGE SCALING ON TRADE-OFF IN HIGH-SPEED
ADC’s 136
5.3.1 Slew Rate Dominated Circuits vs. Settling Time Dominated
Circuits 143
5.4 SOLUTIONS FOR LOW VOLTAGE ADC DESIGN 145
5.4.1 Technological Modifications 145
5.4.2 System Level 146
5.4.3 Architectural Level 146
5.5 COMPARISON WITH PUBLISHED ADC’s 147
5.6 SUMMARY 148
viii Contents

6. LOW POWER FLIP-FLOP AND CLOCK NETWORK DESIGN


METHODOLOGIES IN HIGH-PERFORMANCE SYSTEM-ON-
A-CHIP 151
CHULWOO KIM AND SUNG-MO (STEVE) KANG
6.1 INTRODUCTION 151
6.1.1 Power Consumption in VLSI Chips 151
6.1.2 Power Consumption of Clocking System in VLSI Chips. 152
6.2 HIGH-PERFORMANCE FLIP-FLOPS 155
6.3 LOW-POWER FLIP-FLOPS 156
6.3.1 Master-Slave Latch Pairs 157
6.3.2 Statistical Power Reduction Flip-Flops 158
6.3.3 Small-Swing Flip-Flops 160
6.3.4 Double-Edge Triggered Flip-Flops 163
6.3.5 Low-Swing Clock Double-Edge Triggered Flip-Flop 166
6.3.6 Comparisons of Simulation Results 170
6.4 MORE ON CLOCKING POWER-SAVING METHODOLOGIES 171
6.4.1 Clock Gating 172
6.4.2 Embedded Logic in Flip-Flops 173
6.4.3 Clock Buffer (Repeater) and Tree Design 173
6.4.4 Potential Issues in Multi-GHz SoCs in VDSM Technology
174
6.5 COMPARISON OF POWER-SAVING APPROACHES 174
6.6 SUMMARY 176
7. POWER OPTIMIZATION BY DATAPATH WIDTH
ADJUSTMENT 181
HIROTO YASUURA AND HIROYUKI TOMIYAMA
7.1 INTRODUCTION 181
7.2 POWER CONSUMPTION AND DATAPATH WIDTH 183
7.2.1 Datapath Width and Area 183
7.2.2 Energy Consumption and Datapath Width 185
7.2.3 Dynamic Adjustment of Datapath Width 186
7.3 BIT-WIDTH ANALYSIS 187
7.4 DATAPATH WIDTH ADJUSTMENT ON A SOFT-CORE
PROCESSOR 188
7.5 CASE STUDIES 193
7.5.1 ADPCM Decoder LSI 193
7.5.2 MPEG-2 AAC Decoder 194
7.5.3 MPEG-2 Video Decoder Processors 195
7.6 QUALITY-DRIVEN DESIGN 196
7.7 SUMMARY 198
Contents ix

8. ENERGY-EFFICIENT DESIGN OF HIGH-SPEED LINKS 201


GU-YEON WEI , MARK HOROWITZ, AND AEKA KIM
8.1 INTRODUCTION 201
8.2 OVERVIEW OF LINK DESIGN 203
8.2.1 Figures of Merit 204
8.2.2 Transmitter 206
8.2.2.1 High-impedance Drivers 206
8.2.2.2 Single-ended vs. Differential 207
8.2.2.3 Slew-rate Control 209
8.2.3 Receiver 210
8.2.4 Clock Synthesis and Timing Recovery 212
8.2.5 Putting It Together 214
8.3 APPROACHES FOR ENERGY EFFICIENCY 215
8.3.1 Parallelism 215
8.3.1.1 Sub-clock Period Symbols 216
8.3.1.2 Pulse-amplitude Modulation 218
8.3.2 Adaptive Power-Supply Regulation 218
8.3.3 Putting It Together 222
8.4 EXAMPLES 222
8.4.1 Supply-Regulated PLL and DLL Design 223
8.4.1.1 DLL 223
8.4.1.2 PLL design 226
8.4.2 Adaptive-Supply Serial Links 228
8.4.2.1 Multi-phase Clock Generation 229
8.4.2.2 Low-voltage Transmitter and Receiver 230
8.4.2.3 Clock-recovery PLL 231
8.4.3 Low-Power Area-Efficient Hi-Speed I/O Circuit
Techniques 232
8.4.3.1 Transmitter 234
8.4.3.2 Receiver 234
8.4.4 Putting It Together 235
8.5 SUMMARY 236
9. SYSTEM AND MICROARCHITECTURAL LEVEL POWER
MODELING, OPTIMIZATION, AND THEIR IMPLICATIONS
IN ENERGY AWARE COMPUTING 241
DIANA MARCULESCU AND RADU MARCULESCU
9.1 INTRODUCTION 241
9.2 SYSTEM-LEVEL MODELING AND DESIGN EXPLORATION 242
9.3 THE SAN MODELING PARADIGM 245
9.3.1 The SAN Model Construction 246
9.3.2 Performance Model Evaluation 248
x Contents

9.4 CASE STUDY: POWER-PERFORMANCE OF THE MPEG-2 VIDEO


DECODER APPLICATION 249
9.4.1 System Specification 249
9.4.2 Application Modeling 250
9.4.3 Platform Modeling 251
9.4.4 Mapping 252
9.5 RESULTS AND DISCUSSION 253
9.5.1 Performance Results 254
9.5.2 Power Results 256
9.6 MICROARCHITECTURE-LEVEL POWER MODELING 257
9.7 EFFICIENT PROCESSOR DESIGN EXPLORATION FOR LOW
POWER 261
9.7.1 Efficient Microarchitectural Power Simulation 262
9.7.2 Design Exploration Trade-offs 265
9.8 IMPLICATIONS OF APPLICATION PROFILE ON ENERGY-AWARE
COMPUTING 268
9.8.1 On-the-fly Energy Optimal Configuration Detection and
Optimization 269
9.8.2 Energy Profiling in Hardware 269
9.8.3 On-the-fly Optimization of the Processor Configuration 270
9.8.4 Selective Dynamic Voltage Scaling 270
9.8.5 Effectiveness of Microarchitecture Resource Scaling 271
9.8.6 Comparison with Static Throttling Methods 272
9.9 SUMMARY 273
10. TOOLS AND TECHNIQUES FOR INTEGRATED
HARDWARE-SOFTWARE ENERGY OPTIMIZATIONS 277
N. VIJAYKRISHNAN, M. KANDEMIR, A. SIVASUBRAMANIAM, AND M.
J. IRWIN
10.1 INTRODUCTION 277
10.2 P OWER M ODELING 279
10.3 DESIGN OF SIMULATORS 281
10.3.1 A SimOS-Based Energy Simulator 282
10.3.2 Trimaran-based VLIW Energy Simulator 284
10.4 HARDWARE-SOFTWARE OPTIMIZATIONS: CASE STUDIES 286
10.4.1 Studying the Impact of Kernel and Peripheral Energy
Consumption 286
10.4.2 Studying the Impact of Compiler Optimizations 289
10.4.2.1 Superblock 289
10.4.2.2 Hyperblock 289
10.4.3 Studying the Impact of Architecture Optimizations 291
10.5 SUMMARY 292
Contents xi

11. POWER-AWARE COMMUNICATION SYSTEMS 297


MANI SRIVASTAVA
11.1 INTRODUCTION 298
11.2 WHERE DOES THE ENERGY GO IN WIRELESS
COMMUNICATIONS 299
11.2.1 Electronic and RF Energy Consumption in Radios 299
11.2.2 First-order Energy Model for Wireless Communication 302
11.2.3 Power consumption in Short-range Radios 302
11.3 POWER REDUCTION AND MANAGEMENT FOR WIRELESS
COMMUNICATIONS 304
11.4 LOWER LAYER TECHNIQUES 305
11.4.1 Dynamic Power Management of Radios 305
11.4.1.1 The Energy-speed Control Knobs 306
11.4.1.2 Exploiting the Radio-level Energy-speed Control
knobs in the Energy-aware Packet Scheduling 310
11.4.2 More Lower-layer Energy-speed Control Knobs 315
11.4.2.1 Frame Length Adaptation 316
11.4.3 Energy-aware Medium Access Control 317
11.5 HIGHER LAYER TECHNIQUES 319
11.5.1 Network Topology Management 319
11.5.1.1 Topology Management via Energy vs. Density Trade-
off 321
11.5.1.2 Topology Management via Energy vs. Set-up Latency
Trade-off 324
11.5.1.3 Hybrid Approach 329
11.5.2 Energy-aware Data Routing 329
11.6 SUMMARY 332
12. POWER-AWARE WIRELESS MICROSENSOR NETWORKS
335
REX MIN, SEONG-HWAN CHO, MANISH BHARDWAJ, EUGENE SHIH,
ALICE WANG, ANANTHA CHANDRAKASAN
12.1 INTRODUCTION 335
12.2 NODE ENERGY CONSUMPTION CHARACTERISTICS 338
12.2.1 Hardware Architecture 338
12.2.2 Digital Processing Energy 339
12.2.3 Radio Transceiver Energy 341
12.3 POWER AWARENESS THROUGH ENERGY SCALABILITY 343
12.3.1 Dynamic Voltage Scaling 343
12.3.2 Ensembles of Systems 345
12.3.3 Variable Radio Modulation 346
12.3.4 Adaptive Forward Error Correction 349
xii Contents

12.4 POWER-AWARE COMMUNICATION 355


12.4.1 Low-Power Media Access Control Protocol 355
12.4.2 Minimum Energy Multihop Forwarding 359
12.4.3 Clustering and Aggregation 361
12.4.4 Distributed Processing through System Partitioning 363
12.5 NODE PROTOTYPING 365
12.5.1 Hardware Architecture 366
12.5.2 Measured Energy Consumption 369
12.6 FUTURE DIRECTIONS 369
12.7 SUMMARY 370
13. CIRCUIT AND SYSTEM LEVEL POWER MANAGEMENT
373
FARZAN FALLAH AND MASSOUD PEDRAM
13.1 INTRODUCTION 373
13.2 SYSTEM-LEVEL POWER MANAGEMENT TECHNIQUES 377
13.2.1 Greedy Policy 377
13.2.2 Fixed Time-out Policy 378
13.2.3 Predictive Shut-down Policy 378
13.2.4 Predictive Wake-up Policy 379
13.2.5 Stochastic Methods 379
13.2.5.1 Modeling and Optimization Framework 380
13.2.5.2 A Detailed Example 381
13.2.5.3 Adaptive Power Control 385
13.2.5.4 Battery-aware Power Management 386
13.3 COMPONENT-LEVEL POWER MANAGEMENT TECHNIQUES 386
13.3.1 Dynamic Power Minimization 387
13.3.1.1 Clock Gating 388
13.3.1.2 Dynamic Voltage and Frequency Scaling 392
13.3.1.3 Pre-computation 396
13.3.2 Leakage Power Minimization 398
13.3.2.1 Power Gating 401
13.3.2.2 Body Bias Control 405
13.3.2.3 Minimum Leakage Vector Method 406
13.4 SUMMARY 409
14. TOOLS AND METHODOLOGIES FOR POWER SENSITIVE
DESIGN 413
JERRY FRENKIL
14.1 INTRODUCTION 413
14.2 THE DESIGN AUTOMATION VIEW 414
14.2.1 Power Consumption Components 415
14.2.2 Different Types of Power Tools 417
Contents xiii

14.2.3 Power Tool Data Requirements 418


14.2.3.1 Design Data 419
14.2.3.2 Environmental Data 419
14.2.3.3 Technology Data & Power Models 421
14.2.3.4 Modeling Standards 423
14.2.4 Different Types of Power Measurements 426
14.2.4.1 Power Dissipation and Power Consumption 426
14.2.4.2 Instantaneous Power 426
14.2.4.3 RMS Power 427
14.2.4.4 Time Averaged Power 427
14.3 TRANSISTOR LEVEL TOOLS 427
14.3.1 Transistor Level Analysis Tools 428
14.3.2 Transistor Level Optimization Tools. 428
14.3.3 Transistor Level Characterization and Modeling Tools 429
14.3.4 Derivative Transistor Level Tools. 430
14.4 GATE-LEVEL TOOLS 431
14.4.1 Gate-Level Analysis Tools 432
14.4.2 Gate-Level Optimization Tools 432
14.4.3 Gate-Level Modeling Tools. 435
14.4.4 Derivative Gate-Level Tools 436
14.5 REGISTER TRANSFER-LEVEL TOOLS 437
14.5.1 RTL Analysis Tools 438
14.5.2 RTL Optimization Tools 440
14.6 BEHAVIOR-LEVEL TOOLS 440
14.6.1 Behavior-Level Analysis Tools. 441
14.6.2 Behavior-Level Optimization Tools 442
14.7 SYSTEM-LEVEL TOOLS 442
14.8 A POWER-SENSITIVE DESIGN METHODOLOGY 443
14.8.1 Power-Sensitive Design 444
14.8.2 Feedback vs. Feed Forward 444
14.9 A VIEW TO THE FUTURE 447
14.10 SUMMARY 447
15. RECONFIGURABLE PROCESSORS — THE ROAD TO
FLEXIBLE POWER-AWARE COMPUTING 451
J. RABAEY, A. ABNOUS, H. ZHANG, M. WAN, V. GEORGE, V.
PRABHU
15.1 INTRODUCTION 451
15.2 PLATFORM-BASED DESIGN 452
15.3 OPPORTUNITIES FOR ENERGY MINIMIZATION 454
15.3.1 Voltage as a Design Variable 455
15.3.2 Eliminating Architectural Waste 455
xiv Contents

15.4 PROGRAMMABLE ARCHITECTURES—AN


OVERVIEW 456
15.4.1 Architecture Models 457
15.4.2 Homogeneous and Heterogeneous Architectures 460
15.4.3 Agile Computing Systems (Heterogeneous Compute
Systems-on-a-chip) 461
15.5 THE BERKELEY PLEIADES PLATFORM [10] 462
15.5.1 Concept 462
15.5.2 Architecture. 463
15.5.3 Communication Network 465
15.5.4 Benchmark Example: The Maia Chip [10] 466
15.6 ARCHITECTURAL INNOVATIONS ENABLE CIRCUIT-
LEVEL OPTIMIZATIONS 469
15.6.1 Dynamic Voltage Scaling 469
15.6.2 Reconfigurable Low-swing Interconnect Network 470
15.7 S UMMARY 471
16. ENERGY-EFFICIENT SYSTEM-LEVEL DESIGN 473
LUCA BENINI AND GIOVANNI DE MICHELI
16.1 INTRODUCTION 473
16.2 SYSTEMS ON CHIPS AND THEIR DESIGN 474
16.3 SOC CASE STUDIES 477
16.3.1 Emotion Engine 477
16.3.2 MPEG4 Core 479
16.3.3 Single-chip Voice Recorder 482
16.4 DESIGN OF MEMORY SYSTEMS 484
16.4.1 On-chip Memory Hierarchy 485
16.4.2 Explorative Techniques 487
16.4.3 Memory Partitioning 488
16.4.4 Extending the Memory Hierarchy 489
16.4.5 Bandwidth Optimization 490
16.5 DESIGN OF INTERCONNECT NETWORKS 491
16.5.1 Signal Transmission on Chip 492
16.5.2 Network Architectures and Control Protocols 493
16.5.3 Energy-efficient Design: Techniques and Examples 494
16.5.3.1 Physical Layer 495
16.5.3.2 Data-link Layer 495
16.5.3.3 Network Layer 497
16.5.3.4 Transport Layer 498
16.6 SOFTWARE 498
16.6.1 System Software 499
16.6.1.1 Dynamic Power Management 500
16.6.1.2 Information-flow Management 502
Contents xv

16.6.2 Application Software 502


16.6.2.1 Software Synthesis 504
16.6.2.2 Software Compilation 508
16.6.2.3 Application Software and Power Management 510
16.7 SUMMARY 510
INDEX 517
This page intentionally left blank
Contributors

A. Abnous University of California, Berkeley


L. Benini Università di Bologna, Bologna – Italy
M. Bhardwaj Massachusetts Institute of Technology
A. Chandrakasan Massachusetts Institute of Technology
S. H. Cho Massachusetts Institute of Technology
F. Fallah Fujitsu Labs. of America, Inc
D. J. Frank IBM T. J. Watson Research Center
J. Frenkil Sequence Design, Inc.
V. George University of California,Berkeley
M. Horowitz Stanford University
M. J. Irwin Pennsylvania State University, University Park
S. Kang University of California, Santa Curz
M. Kandemir Pennsylvania State University, University Park
A. Kim Stanford University
J. Kim Stanford University
C. Kim IBM, Microelectronics Division
T. Kuroda Keio University
R. Marculescu Carnegie Mellon University
D. Marculescu Carnegie Mellon University
G. Micheli Stanford University
R. Min Massachusetts Institute of Technology
Y. Oowaki Toshiba Corp
M. Pedram University of Southern California
V. Prabhu University of California, Berkeley
J. Rabaey University of California, Berkeley
E. Shih Massachusetts Institute of Technology
xviii

M. Srivastava University of California, Los Angeles


M. Steyaert Katholieke Universiteit Leuven, ESAT-MICAS
T. Tanzawa Toshiba Corp
H. Tomiyama Institute of System and Information Technologies
K. Uyttenhove Katholieke Universiteit Leuven, ESAT-MICAS
N. Vijaykrishnan Pennsylvania State University, University Park
M. Wan University of California, Berkeley
Al. Wang Massachusetts Institute of Technology
H. Yasuura System LSI Research Center
G. Wei Harvard University
H. Zhang University of California, Berkeley
Preface

The semiconductor industry has experienced phenomenal growth over the


last few decades. During this period of growth, minimum feature sizes have
decreased on average by 14% per year from in 1980 to in 2002,
die areas have increased by 13% per year, and design complexity (as
measured by the number of transistors on a chip) has increased at an annual
growth rate of 50% for DRAMs and 35% for microprocessors. Performance
enhancements have been equally impressive. For example, clock frequencies
for leading-edge microprocessors have increased by more than 30% per year.
The semiconductor industry has maintained its growth by achieving a 25-
30% per-year cost reduction per function over the past 35 years. This
productivity growth in integrated circuits has been made possible through
technological advances in device manufacturing and packaging, circuits and
architectures, and design methodologies and tools.
The semiconductor industry is now at a critical junction where it appears
that an unprecedented number of technical challenges threaten the
continuation of Moore's Law. Three formidable challenges are the
“technology challenge” i.e., 50nm and below lithography, the “power
challenge” i.e., sub-microwatt power dissipation per MIPS concurrently with
thousands of MIPS performance, and the “design productivity challenge”
i.e., improvement in design productivity at a rate of 50% or higher per year.
These technological challenges must be solved in order to be able to
continue the historical trends dictated by Moore’s Law (at least for another
12-15 years).
This book addresses the “power challenge. ” It is a sequel to our Low-
Power Design Methodologies book, published by Kluwer Academic
Publishers in 1996. The focus of the present book is, however, on power-
xx

awareness in design. The difference low-power design and power-awareness


in design is that whereas low-power design refers to minimizing power with
or without a performance constraint, power-aware design refers to
maximizing some other performance metric subject to a power budget (even
while reducing power dissipation).
The book has been conceived as an effort to bring all aspects of power-
aware design methodologies together in a single document. It covers several
layers of the design hierarchy from technology, circuit, logic, and
architectural levels up to the system layer. It includes discussion of
techniques and methodologies for improving the power efficiency of CMOS
circuits (digital and analog), systems on chip, microelectronic systems,
wirelessly networked systems of computational nodes, and so on. In addition
to providing an in-depth analysis of the sources of power dissipation in VLSI
circuits and systems and the technology and design trends, this book
provides a myriad of state-of-the-art approaches to power optimization and
control.
The different chapters of this book have been written by leading
researchers and experts in their respective areas. Contributions are from both
academia and industry. The contributors have reported the various
technologies, methodologies, and techniques in such a way that they are
understandable and useful to the circuit and system designers, tool
developers, and academic researchers and students.
This book may be used as a textbook for teaching an advanced course on
power-aware design methodologies. When and if combined with the Low-
Power Design Methodologies book, it will provide a comprehensive
description of various power-aware and/or low-power design methodologies
and approaches. Instructors can select various combinations of chapters and
augment them with some of the many references provided at the end of each
chapter to tailor the book to their educational needs.
The authors would like to acknowledge the help of Chang-woo Kang and
Melissa Camacho with the preparation of the final manuscript for the book.
Also thanks to Carl Harris for his cooperation as we pushed back the
deadline for final manuscript submission.
We hope that this book – as was the case with its predecessor – will serve
as a broad, yet thorough, introduction for anyone interested in addressing the
“power challenge” in VLSI circuits and systems. To reiterate, solving this
problem is essential if we are to maintain the technology scaling curve
predicted by Moore’s Law.

Massoud Pedram, Los Angeles CA


Jan Rabaey, Berkeley CA
Chapter 1

Introduction

1 2
Massoud Pedram , Jan Rabaey
1 2
University of Southern California; University of California, Berkeley

Abstract: This chapter provides the motivations for power-aware design, reviews main
sources of power dissipation in CMOS VLSI circuits, hints at a number of
circuit and system-level techniques to improve the power efficiency of the
design, and finally provides an overview of the book content. The chapter
concludes with a list of key challenges for designing low-power circuits or
achieving high power efficiency in designs.

Key words: Low-power design, power-aware design, low-power circuit techniques, energy
efficiency, CMOS devices, Moore’s Law, technology scaling, static power,
dynamic power, voltage scaling, power management, reconfigurable
processors, design methodologies, design tools.

1.1 INTRODUCTION

A dichotomy exists in the design of modern microelectronic systems: they


must be simultaneously low power and high performance. This dichotomy
largely arises from the use of these systems in battery-operated portable
(wearable) platforms. Accordingly, the goal of low-power design for battery-
powered electronics is to extend the battery service life while meeting
performance requirements. Unless optimizations are applied at different
levels, the capabilities of future portable systems will be severely limited by
the weight of the batteries required for an acceptable duration of service. In
fixed, power-rich platforms, the packaging cost and power density/reliability
issues associated with high power and high performance systems also force
designers to look for ways to reduce power consumption. Thus, reducing
power dissipation is a design goal even for non-portable devices since
2 Sources of Power Consumption

excessive power dissipation results in increased packaging and cooling costs


as well as potential reliability problems. Ldi/dt noise concerns have also
become an important factor that demands low-power consumption in high-
performance integrated circuits. Therefore, as power dissipation increases,
the cost of power delivery to the ever-increasing number of transistors on a
chip multiplies rapidly.

1.2 SOURCES OF POWER CONSUMPTION

There are two kinds of power dissipation in synchronous CMOS digital


circuitry: dynamic and static. Dynamic power dissipation includes the
capacitive power that is associated with the switching of logic values in the
circuit. This power component is essential to performing useful logic
operations and is proportional to where C is the total
capacitance, is the supply voltage, is the voltage swing, f is the
clock frequency, and α denotes the expected number of transitions per clock
cycle. Since this power dissipation is in direct proportion to the complexity
of the logic, rate of computation, and switching activity of the circuit nodes,
it can be minimized by performing circuit/architectural optimizations, by
adjusting the circuit speed, and by scaling the supply voltage and/or using
low-voltage-swing signaling techniques. Dynamic power also includes a
short-circuit power that flows directly from the supply to ground during a
transition at the output of a CMOS gate. This power is also wasted, but with
careful design, it can generally be kept to be less than 15% of the capacitive
power consumption.
Static power is associated with maintaining the logic values of internal
circuit nodes between the switching events. This power dissipation is due to
leakage mechanisms within the device or circuit and does not contribute to
any useful computation. Unfortunately, mechanisms that cause leakage are
becoming worse as CMOS technology scaling proceeds. It is common
knowledge that static power dissipation will play a central role in
determining how long and far Moore’s Law can continue unabated.

1.3 LOW-POWER VERSUS POWER-AWARE DESIGN

Many different metrics have been used to capture the notion of “power and
timing efficiency.” The most commonly used ones are (average) power,
power per MIPS (million instructions per second), energy, energy-delay
product, energy-delay squared, peak power, and so on. The choice of which
design metric to use during the circuit or system optimization is strongly
Introduction 3

dependent on the application and target performance specifications. For


example, in battery-powered CMOS circuits, energy is often the correct
metric whereas in high-performance microprocessors the power/MIPS ratio
or the energy-delay product is used.
Designing for low power is, however, different than power-aware (or
power-efficient) design. Whereas the former refers to the problem of
minimizing power with or without a performance constraint, the latter refers
to maximizing some other performance metric such as throughput,
bandwidth, or quality of service subject to a power budget (or in some cases
concurrently attempting to reduce power dissipation.) The design tradeoffs
and optimum solutions may thus be quite different for a low-power design
and a power-aware design. For example, in the context of (on-demand)
source-driven routing protocols for mobile ad hoc networks, a low-power
source routing protocol will attempt to find the minimum routing solution
between a source node and a destination node subject to a latency (number
of hops) constraint. In contrast, a power-aware source routing protocol will
try to find a routing solution that would maximize the lifetime of the
networked system (i.e., the elapsed time beyond which K% of the nodes in
the network exhaust their energy budget).
Obviously, both low-power design and power-aware design techniques
are needed to address the “power problem.” The emphasis of the current
book, which is a sequel to our Low-Power Design Methodologies book [1],
is on power-awareness in design methodologies, techniques, and tools.

1.4 POWER REDUCTION MECHANISMS IN CMOS


CIRCUITS

Aside from technology scaling, reducing only the supply voltage for a given
technology enables significant reduction in power consumption. However,
voltage reduction comes at the expense of slower gate speeds. So, there is a
tradeoff between circuit speed and power consumption. By dynamically
adjusting the supply voltage to the minimum needed to operate at an
operating frequency that meets the computational requirements of the circuit,
one can reduce the power consumption of digital CMOS circuits down to the
minimum required. This technique is referred to as dynamic voltage scaling.
Notice that the rules for analog circuits are quite different than those applied
to digital circuits. Indeed, downscaling the supply voltage does not
automatically decrease analog power consumption.
It has become apparent that the voltage scaling approach is insufficient
by itself. One must also focus on advanced design tools and methodologies
that address the power issues. The list of these issues is lengthy: power grid
4 Power Reduction Techniques in Microelectronic Systems

sizing and analysis, power-efficient design of the clock distribution network


and flip-flops, datapath width adjustment to minimize the chip logic
complexity and power consumption, effective circuit structures and
techniques for clock gating and power gating, low-power non-volatile
memory technology, design with dynamically-varying supply voltages and
threshold voltages, design with multiple supply voltages and multiple
threshold voltages, and energy-efficient design of high speed links on-chip
and off-chip, to name a few. Complicating designers’ efforts to deal with
these issues are the complexities of contemporary IC designs and the design
flows required to build them. Multi-million gate designs are now common,
incorporating embedded processors, DSP engines, numerous memory
structures, and complex clocking schemes.
Just as with performance, power awareness requires careful design at
several levels of abstraction. The design of a system starts from the
specification of the system functionality and performance requirements and
proceeds through a number of design levels spanning across architectural
design, register transfer level design, and gate level design, finally resulting
in a layout realization. Power efficiency can be obtained at all levels of
design, ranging from low-level circuit and gate optimizations to algorithmic
selection.

1.5 POWER REDUCTION TECHNIQUES IN


MICROELECTRONIC SYSTEMS

A typical system consists of a base hardware platform, which executes


system and application software and is connected to a number of peripheral
devices. The hardware platform refers to the computational units, the
communication channels, and the memory units. Examples of peripheral
devices include displays, wireless local area networks, and camcorders. The
low power objective is achieved by reducing power dissipation in all parts of
the design. At the same time, choices for the software implementation (in
terms of choosing the operating system kernel and the application-level
code) affect the power efficiency of the base hardware and peripheral
devices. For example, the power dissipation overhead of the operating
system calls, the power-efficiency of the compiled code, and the memory
access patterns play important roles in determining the overall power
dissipation of the embedded system.
Distributed networks of collaborating microsensors are being developed
as a platform for gathering information about the environment. Because a
microsensor node must operate for years on a tiny battery, careful and
innovative techniques are necessary to eliminate energy inefficiencies. The
Introduction 5

sensors must operate on a tiny battery for many months and must be able to
communicate wirelessly with each other. They must also be able to increase
their compute power when and if needed (performance on demand) and must
dissipate nearly zero energy dissipation during long idle periods. This
scenario poses a number of unique challenges that require power-awareness
at all levels of the communication hierarchy, from the link layer to media
access to routing protocols, as well as power-efficient hardware design and
application software.
Another emerging trend in embedded systems is that they are being
networked to communicate, often wirelessly, with other devices. In such
networked systems, the energy cost of wireless communications often
dominates that of computations. Furthermore, in many networked systems,
the energy-related metric of interest is the lifetime of the entire system, as
opposed to power consumption at individual nodes. A technique that
consumes less average power but results in a high variance in power
consumption where a small number of nodes see a large energy drain is
undesirable. Conventional approaches to power efficiency in computational
nodes (e.g., dynamic power management and dynamic voltage scaling) need
to be extended to work in the context of a networked system of nodes.

1.6 BOOK ORGANIZATION AND OVERVIEW

This book is organized as follows. In Chapter 2, D. J. Frank of IBM T. J.


Watson Research Center describes the present state of CMOS technology
and the scaling principles that drive its progress. The physical effects that
hinder this scaling are also examined to show how these effects interact with
the practical constraints imposed by power dissipation. A brief overview of
some of the novel device options for extending the limits of scaling is also
provided.
In Chapter 3, Y. Oowaki and T. Tanzawa of Toshiba SoC R&D Center
describe the power-aware design of LSI memories. This chapter focuses on
non-volatile flash memories, non-volatile ferroelectric memories, and
embedded DRAMs, which are increasingly important in the
Wireless/Internet era.
In Chapter 4, T. Kuroda of Keio University describes several circuit
techniques for power-aware design, including techniques for a variable
supply voltage, a variable threshold voltage, multiple supply voltages,
multiple threshold voltages, a low-voltage SRAM, a conditional flip-flop,
and an embedded DRAM.
In Chapter 5, K. Uyttenhove and M. Steyaert of Katholieke Universiteit
Leuven, after a general introduction on the limits to low power for analog
6 Book Organization and Overview

circuits, deal with the impact of reduced supply voltage on the power
consumption of high-speed analog to digital converters (ADC). A
comparison with the power consumption of published high-speed analog to
digital converters will also be presented.
In Chapter 6, C-W. Kim and S-M. Kang of IBM Microelectronics
Division and the University of California - Santa Cruz describe techniques to
reduce power consumptions in both the clock tree and flip-flops. Clock-
gating and logic-embedding techniques are also presented as effective
power-saving techniques, followed by a low-power clock buffer design.
In Chapter 7, H. Yasuura and H. Tomiyama of Kyushu University
introduce several design techniques to reduce the wasteful power
consumption by redundant bits in a datapath. The basic approach is datapath
width adjustment. It is shown that during hardware design, using the result
of bit-width analysis, one can determine the minimal length of registers, the
size of operation units, and the width of memory words on the datapath of a
system in order to eliminate the wasteful power consumption by the
redundant bits.
In Chapter 8, G-Y. Wei, M. Horowitz, and J. Kim of Harvard University
and Stanford University provide a brief overview of high-speed link design
and describe some of the power vs. performance tradeoffs associated with
various design choices. The chapter then investigates various techniques that
a designer may employ in order to reduce power consumption. Three
examples of link designs and link building blocks found in the literature
serve as examples to illustrate energy-efficient implementations of these
techniques.
In Chapter 9, D. Marculescu and R. Marculescu of Carnegie Mellon
University present a design exploration methodology that is meant to
discover the power/performance tradeoffs that are available at both the
system and microarchitectural levels of design abstraction.
In Chapter 10, N. Vijaykrishnan, M. Kandemir, A. Sivasubramaniam,
and M. J. Irwin of Pennsylvania State University describe the design of
energy estimation tools that support both software and architectural
experimentation within a single framework. This chapter presents the details
of two different architectural simulators targeted at superscalar and VLIW
architectures. Finally, techniques that optimize the hardware-software
interaction from an energy perspective are illustrated.
In Chapter 11, M. Srivastava of the University of California - Los
Angeles describes communication-related sources of power consumption
and network-level power-reduction and energy-management techniques in
the context of wirelessly networked systems such as wireless multimedia and
wireless sensor networks. General principles behind power-aware protocols
and resource management techniques at various layers of networked systems
Introduction 7

- physical, link, medium access, routing, transport, and application - are


presented.
In Chapter 12, R. Min, S-H. Cho, M. Bhardwaj, E. Shih, A. Wang, and
A. Chandrakasan of the Massachusetts Institute of Technology describe
techniques for power-aware wireless microsensor networks design. All
levels of the communication hierarchy, from the link layer to media access to
routing protocols, are discussed with careful attention to the details of energy
consumption at every point in the design process.
In Chapter 13, F. Fallah and M. Pedram of Fujitsu Labs. of America and
University of Southern California present several dynamic power
management and voltage scaling techniques to reduce the dynamic power
dissipation in microelectronic systems. Furthermore, a number of runtime
techniques for reducing the static power dissipation in VLSI circuits are
introduced.
In Chapter 14, J. Frenkil of Sequence Design discusses the various types
of design automation that focus on power aware design. The chapter
includes a survey, by abstraction level, of the different types of power tools
for different applications, including representative examples of
commercially available tools. Following the survey, a power-sensitive
design flow is presented illustrating the use of the tools previously described.
In Chapter 15, J. Rabaey, A. Abnous, H. Zhang, M. Wan, V. George, and
V. Prabhu University of California - Berkeley explore the concept of
platform-based design approach to systems-on-a-chip and investigate the
opportunity for substantial power reduction by using hybrid reconfigurable
processors. With the aid of an extensive example, it is demonstrated that
power reductions of orders of magnitude are attainable.
In Chapter 16, L. Benini and G. De Micheli of Università di Bologna and
Stanford University survey some of the challenges in achieving energy-
efficient system-level design with specific emphasis on System-on-Chip
implementation.

1.7 SUMMARY

In concluding this introduction, it is worthwhile to summarize the major


challenges in designing low-power circuits and/or achieving high power
efficiency that, in our view, have to be addressed in order to keep up with the
CMOS technology advances as dictated by Moore’s Law.
Advances in CMOS technology to combat excessive leakage
current that comes about with lower threshold voltages
Effective circuit and system level mechanisms for leakage
current minimization and control in VLSI designs
8 Summary

Designing under very low-supply voltages in the presence of


large device and interconnect variations
Circuit design techniques for high-speed and low-power
communication links and data encoding and electrical signaling
techniques for on-chip and off-chip communication channels
Low-voltage analog circuit design techniques
Asynchronous and/or globally asynchronous locally synchronous
design styles and tools and methodologies to make such styles
feasible and competitive with existing ASIC design flows and
tools
Efficient and accurate design exploration and analysis tools for
power and performance
Power-aware system-level design techniques and tools
Power-efficient network technologies for future SoCs and micro-
network architectural choices and control protocol design
More effective and robust dynamic system-level power
management methodologies and tools that handle arbitrary
workload conditions, complex models of the system components,
and realistic models of the battery sources
Power-aware runtime environments for complex embedded
systems, low-power embedded OS synthesis, and power-aware
network architectures and protocols
Power-efficient application software design, synthesis, and
compilation as well as information flow links between the
compilers and the runtime environments
A framework for computation vs. communication tradeoff from a
power, performance and quality of service viewpoints
Energy-efficient reconfigurable circuits and architectures and
associated issues related to system-level power optimization,
power-aware compilation, and synthesis tools
Power-aware protocols for mobile ad hoc networks, including
application, network, data link, and physical layers

REFRENCES
[1] J. Rabaey and M. Pedram, Low Power Design Methodologies, Kluwer Academic
Publishers, 1996.
Chapter 2
CMOS Device Technology Trends for Power-
Constrained Applications

David J. Frank
IBM T. J. Watson Research Center

Abstract: CMOS device technology has scaled rapidly for nearly three decades and has
come to dominate the electronics world. Because of this scaling, CMOS
circuits have become extremely dense, and power dissipation has become a
major design consideration. Although industry projections call for at least
another 10 years of progress, this progress will be difficult and is likely to be
strongly constrained by power dissipation.

This chapter describes the present state of CMOS technology and the scaling
principles that drive its progress. The physical effects that hinder this scaling
are also examined in order to show how these effects interact with the practical
constraints imposed by power dissipation. It is shown that scaling does not
have a single end. Rather, each application has a different optimal end point
that depends on its power dissipation requirements. A brief overview of some
of the novel device options for extending the limits of scaling is also provided.

Key words: Low power, CMOS, device technology, scaling

2.1 INTRODUCTION

Although the basic concept of the Field Effect Transistor (FET) was
invented in 1930 [1], it was not until 1960 that it was first reduced to
practice in by Kahng and Attala [2]. Since then development has
been very rapid. The Si MOSFET has been incorporated into integrated
circuits since the early 1970s, and progress has followed an exponential
behaviour that has come to be known as Moore’s Law [3]. The device
dimensions have been shrinking at an exponential rate, and the circuit
complexity and industry revenues have been growing exponentially.
10 CMOS Device Technology Trends for Power-Constrained Applications

Since 1994 the semiconductor industry has published projections of these


exponentials into the future to provide technology development targets. The
most recent of these projections is the 2001 International Technology
Roadmap for Semiconductors (ITRS’0l) [4]. According to its projections,
the industry hopes to reach production of 64 Gbit DRAM chips by 2016, in
addition to microprocessor FETs with physical gate length below 10 nm and
local clock frequencies up to 29 GHz.
Unfortunately, this rapid scaling of MOSFET technology is also
accompanied by increasing power dissipation. In the end, it may be this
power dissipation that limits further scaling. There are two primary types of
power dissipation in CMOS circuitry: dynamic and static. The dynamic
power is expended usefully, since it is associated with the switching of logic
states, which is central to performing logic operations. Dynamic power is
proportional to where C is the total capacitance, is the supply
voltage, and f is the clock frequency. Since this power dissipation is in
direct proportion to the rate of computation, it can be adjusted to meet
application power requirements by adjusting the computation rate. It can
also be adjusted, but to a more limited extent, by adjusting the supply
voltage.
Static power, on the other hand, is associated with the holding or
maintaining of logic states between switching events. This power is due to
leakage mechanisms within the device or circuit, and so is wasted because it
does not contribute to computation. Unfortunately, leakage is unavoidable,
and the mechanisms that cause it are rapidly increasing in severity as scaling
proceeds.
There is also transient short-circuit power that flows directly from supply
to ground during a switching event. This power is wasted, too, but with
careful design, it can generally be kept much smaller than the active power
[5].
When these mechanisms are considered in conjunction with the power
dissipation requirements of different applications, it is easy to show that
static power plays a central role in determining how far scaling can go for a
given set of technology elements and that there is no single end to scaling.
Rather, there are many different “ends,” each corresponding to an optimized
technology for a specific application [6]. According to this analysis, further
progress, if it does not come from circuit and system innovations, requires
new technology elements. When new technology elements are invented,
developed, and added to the CMOS technology base, they enable new
optimization points, and the “ends of scaling” can advance.
The organization of this chapter is as follows. The next section
summarizes the present state of the art of CMOS device technology and
describes the industry projections for future progress. The third section
CMOS Technology Summary 11

details the ideas of CMOS scaling that should enable this progress and then
discusses the physical effects that limit this scaling. The fourth section goes
into the optimization of CMOS technology for power-constrained operation
and uses this analysis to provide an estimate of how the limits of scaling
vary with application. The fifth section highlights exploratory CMOS
technology directions that may enable further scaling, and the final section is
a conclusion.

2.2 CMOS TECHNOLOGY SUMMARY

2.2.1 Current CMOS Device Technology

Several recent review articles have summarized the characteristics of state-


of-the-art CMOS technology, see for example [7] and [8]. Figure 2.1
illustrates most of the important features for conventional bulk MOSFET
technology. The gates are made of n- and p-type poly-silicon so that both
nFETs and pFETs are surface channel devices, improving performance over
older technologies in which the pFETs had buried channels. The gates are
topped with a metal silicide, which lowers the gate series resistance. Special
lithographic and etching techniques are used to pattern the gates with
minimum dimensions down to 50% or more below the general lithographic
feature size. The gate dielectric is very thin, approaching 1.0 nm of Si
oxynitride for high performance logic, and 2.5 nm for lower power devices.
The nitrogen content of has been increasing in recent years because this
increases the dielectric constant, enabling effective oxide thicknesses (EOTs)
to be smaller than the physical thickness. Scaling requires these thin gate
insulators in order to adequately limit short channel effects and to maximize
performance. The tunneling leakage current of these thin insulators has
become a major concern for many applications.
Devices are electrically separated by shallow trench isolation (STI),
which involves etching trenches, filling them with deposited oxide and
polishing the oxide to planarize. This process allows devices to be placed
very close together, resulting in high circuit density. The source and drain
consist of very shallow doping extensions under the gate edges and gate
sidewalls, in combination with deeper implants under the contacts. These
are engineered to reduce short channel effects, reduce gate insulator
degradation due to hot electrons and provide low contact resistance to the
silicide.
The precise engineering of the doping profiles in the channel is of great
importance in achieving the shortest possible channel lengths. The retro
12 CMOS Device Technology Trends for Power-Constrained Applications

grade doping profile (doping that is low at the surface and increases with
depth) reduces the transverse electric field in the channel (improving
mobility), while at the same time reducing two-dimensional effects by
improving the shielding of the drain potential from the channel region.
Shallow angled ion implantation results in super-halo doping profiles near
the source and drain regions that partially cancel 2D-induced threshold
voltage shifts, resulting in less roll-off.

The drawing in Figure 2.1does not show the wiring levels, but the wires
are clearly essential in creating large integrated circuits, and substantial
technological progress is occurring there too. Today most of the wire is
copper because of its low resistivity and reduced electromigration. The
wire-to-wire capacitance is reduced by the use of fluorinated silicate glass
(FSG) for the insulator, with permittivity k=3.7, and also by taking
advantage of the low resistivity of copper to somewhat reduce the aspect
ratio of the wires [4]. Even lower k materials may be in use soon. It is
common practice to use a hierarchy of wiring sizes, from very fine wires at
minimum lithographic dimension on the bottom to large “fat” wires on the
top, which helps to keep wire delay under control [9].
In addition to bulk CMOS, partially-depleted silicon-on-insulator (PD-
SOI) CMOS is also available [10], As shown in Figure 2.2, this technology
is similar in many ways to bulk technology, but there are also some
important differences. The main difference is that PD-SOI CMOS is built on
a thin layer of Si, 150 nm thick, on top of an insulating layer. For
partially-depleted SOI, the silicon is thick enough that it is not fully depleted
by the gate bias (20-50 nm, depending on doping). The buried oxide (BOX)
CMOS Technology Summary 13

layer is typically 150-250 nm thick and completely insulates the device layer
from the substrate. It is formed in one of two ways: (a) by heavy
implantation of oxygen into a Si substrate followed by high temperature
annealing (SIMOX) or (b) by a wafer bonding and layer transfer technique
[11]. As a result of this insulating layer, the body of the SOI device is
floating and the source- and drain-to-body junction capacitances are
significantly reduced. Both of these effects can increase digital switching
speed, although the detailed advantages depend on circuit configuration [12].
Since the body is floating, the usual bulk MOSFET body-effect
dependencies on source-to-substrate voltage are absent. These are replaced
by floating-body effects such as history-dependent body bias and increased
output conductance (kink-effect) caused by the injection of majority carriers
into the body by impact ionization of channel carriers in the high-field drain
region.

2.2.2 ITRS Projections

Since 1994 the Semiconductor Industry Association has been creating


“roadmaps” showing how CMOS technology is expected to evolve. These
roadmaps are based on observations of past industry trends (e.g., Moore's
Law) and an understanding of MOSFET scaling principles. The latest of
these roadmaps, ITSR’01, is summarized in Table 2.1 [4]. Historically,
these roadmaps have become obsolete almost as soon as they have been
produced, as the industry has often taken the roadmap values as targets that
14 CMOS Device Technology Trends for Power-Constrained Applications

must be exceeded in order to remain competitive. It remains to be seen


whether this will also be true of the most recent roadmap.
In addition to codifying expected scaling trends, the roadmap also
highlights the technical problems that must be solved to achieve these trends.
For example, huge lithographic problems must be overcome to be able to
pattern the expected future technology. It is hoped that lithography at 193
run wavelength will carry the industry through about 2004, but then the next
shorter convenient wavelength. (157 nm) will be needed, necessitating an
enormous development and installation investment, and yet will only carry
the industry a few more years. By 2007 an altogether new lithography will
be needed, such as EUV (extreme ultraviolet) or EPL (electron projection
lithography). Furthermore, it is anticipated that tricks can be found to shrink
transistor gate lengths far below even the extremely aggressive lithography
projections. Not surprisingly, one of the first “no known solution” problems
anticipated by the roadmap (in 2003!) is the difficulty in achieving sufficient
control of these gate lengths.

Even if lithographic issues are solved, the projected scaling of device


dimensions deep into the nanometer range poses many difficulties, some of
which will be highlighted in the next section. In particular, the statistical
uncertainties created by the discreteness of dopant atoms may well prevent
scaling of conventional MOSFETs below about 20 nm. The roadmap
anticipates that alternate device structures, such as some form of double-gate
Scaling Principles and Difficulties 15

MOSFET will probably be necessary to reach the outer years. Some of these
exploratory device concepts are discussed in Section 5. As indicated in the
table, it is essential that the effective gate insulator thickness decrease very
significantly to achieve these highly scaled devices. As will be shown in the
next section, this cannot be accomplished by simply thinning since the
tunneling leakage current would be too high. Consequently, thicker gate
insulators with a higher dielectric constant than silicon dioxide are critical in
order to reduce the tunneling current while still yielding equivalent
thicknesses down to below 1 nm. Silicon oxynitrides are the first step in this
direction, and other materials with still higher k that can satisfy all the
reliability and interface requirements are under investigation.
Another difficulty with the roadmap projections is the power dissipation,
which is increasing in future technology because of incomplete voltage
scaling and higher subthreshold currents associated with the lower
needed to maintain performance. It will be shown in Section 4 that the
leakage dissipation leads to optimum scaling limits that vary depending on
application. This situation is partially captured in the roadmap projections in
the assumption that low power technology will lag several years behind high
performance technology. The higher currents associated with future
technology also lead to reliability problems. Electromigration-induced
defects in wiring are a serious issue that must be addressed, as are the gate
insulator defects that are triggered by the high tunneling currents through the
insulator [13]. In addition, the extremely high currents may inhibit the use
of burn-in (the stressing of chips at high voltage and temperature to
eliminate early fails).
Although the chip sizes are not expected to increase significantly, wafer
sizes are expected to increase in order to reduce manufacturing cost. 300
mm diameter Si wafers are expected to be in use in production within the
next few years, and still larger wafers (perhaps 450 mm) are being
considered for the future.

2.3 SCALING PRINCIPLES AND DIFFICULTIES

As has been mentioned, the continuing progress in CMOS technology is


based on the physics of scaling MOSFETs. These scaling principles were
originally developed in the early 1970s [14], and have been thoroughly
covered in many recent articles and reviews [15][6][7][8]. This section
provides a brief review of these principles and then goes on to discuss the
physical effects that stand in the way of continuing to apply these scaling
rules.
16 CMOS Device Technology Trends for Power-Constrained Applications

2.3.1 General Scaling

The basic idea of scaling is illustrated schematically in Figure 2.3, which


shows how a large FET can be scaled by a factor to yield a smaller FET.
According to simple electrostatics, if the dimensions, dopings, and voltages
are scaled as shown, the electric field configuration in the scaled device will
be exactly the same as it was in the larger device. These scaling
relationships are summarized in the second column of Table 2.2. Note that
within this simple scheme, the speed increases by the factor and the power
density remains constant.

Unfortunately, real scaling has not been so simple. Power supply


infrastructure (i.e., industry-wide standard supply voltages) and a reluctance
to give up the extra performance that can be obtained at higher voltage have
prevented voltage from scaling as fast as the dimensions. More recently, as
supply voltages have approached the 1 V level, it has become clear that there
are some difficulties with the theory too. In the first place, the built-in
potentials do not scale because they are tied to the Si bandgap energy, which
does not change (except by changing to a different semiconductor). This
problem is not too severe, and can be dealt with by increasing the doping. It
has also been suggested that the body can be forward biased to accomplish
much the same thing as a bandgap reduction [6]. A much more important
difficulty is that the subthreshold slope cannot be scaled (except by lowering
the temperature), since it is primarily determined by the thermodynamics of
the Boltzmann distribution. Because of this, the threshold voltage cannot be
scaled as far as the simple rules would demand or else leakage currents will
become excessive.
To accommodate the slower voltage scaling, an additional scaling factor
is introduced for the electric field (this is greater than 1), as summarized
Scaling Principles and Difficulties 17

under "generalized scaling" in the third column of Table 2.2. Increasing the
electric field necessitates increasing the amount of doping and also increases
the power dissipation, but it does reduce the need to scale The main
disadvantage of this form of scaling is the increased power, but another
problem is that the increasing electric field diminishes the long-term
reliability and durability of the FET. Indeed, this reliability concern forces
the use of lower supply voltages for smaller devices even when power
dissipation is not an issue [15].

The final form of scaling, “generalized selective scaling,” arises in recent


generations of technology where the gate length is scaled more than the
wiring. This is made possible by fabrication tricks such as over-etching the
gate, which enable sub-lithographic gate length while the wiring remains
constrained to the lithographic pitch. This approach is shown in the final
column of Table 2.2 and has two spatial dimension scaling parameters, for
scaling the gate length and device vertical dimensions and for scaling the
device width and the wiring. Since this approach allows gate delay
to scale faster than in the preceding cases. These approaches to scaling and
issues related to them are described in more detail in [15].
Following these rules, successive generations of technology have denser,
higher-performance circuits but with an increase in power density. The
limits of this scaling process are caused by various physical effects that do
not scale properly, including quantum mechanical tunnelling, the
discreteness of dopants, voltage-related effects such as subthreshold swing,
built-in voltage and minimum logic voltage swing, and application-
dependent power dissipation limits.
18 CMOS Device Technology Trends for Power-Constrained Applications

2.3.2 Characteristic Scale Length

Before exploring the effects that limit scaling, it is important to understand


how a good device is designed. The preceding scaling theory shows how to
scale a known good device design and make it smaller but not how to design
a good one in the first place. This leads to the question of what constitutes a
“good” MOSFET. Generally, this means that it has long-channel-like
behaviour, including
High output resistance,
High gain (in a circuit), and
Low sensitivity to process variation.
These are the characteristics required to make robust circuits. In
addition, one would also like the FET to have certain short-channel
behaviors:
High transconductance,
High current drive, and
High switching speed.
Since an FET cannot be both long and short, these two sets of desires are
in conflict, and the design of a MOSFET is a compromise. The gate must be
as short as possible, while still being long enough to have good control over
the channel. Two-dimensional (2D) effects occur when the channel becomes
short enough compared to its thickness that the drain potential can
significantly modulate the potential along the channel. When this happens,
the first set of behaviors is degraded. The extent of the 2D effects can be
well estimated by considering the ratio between the gate length, L, and the
electrostatic scale length for a given FET. This scale length is derived by
considering electrostatic solutions of the form
for the potential in the depletion and insulator regions of a MOSFET and
applying proper dielectric boundary conditions between the two [16]. It is
given implicitly as the largest solution of [17]

for bulk devices, where is the physical thickness of the insulator, is


the thickness of the depletion layer, is the permittivity of Si, and is
the permittivity of the gate insulator. This formula is valid for all
Scaling Principles and Difficulties 19

permittivities and thicknesses, but in the most common regime, where


it can be approximately solved as [6]

Figure 2.4 shows the dependence of various FET characteristics on the


ratio for an idealized bulk MOSFET without super-halo doping [6].
From this analysis it appears that would be a good nominal design
point for non-halo MOSFET technologies, since it allows room for

tolerances of up to ±30% in the gate length at (approximately) the maximum


tolerable variation. To do better than this, halo or super-halo doping
profiles are required. By increasing the body doping near the source and
drain with suitable profiles, these doping techniques can cancel out some of
20 CMOS Device Technology Trends for Power-Constrained Applications

the threshold voltage roll-off. They can also improve the drain-induced
barrier-lowering (DIBL) curve by shifting the peak barrier in the channel
closer to the source, thus making the subthreshold current less sensitive to
drain voltage. As presently practiced, it appears that super-halo doping can
lower the design point down to about 1.5 while still maintaining ±30%
gate length tolerance.
Another way to get a smaller is to improve the processing so that
the required tolerance decreases. For example, if tolerance could be
improved to <±10%, it might be possible to reduce to < 1.0 [18].

2.3.3 Limits to Scaling

The effects that limit scaling can be broadly lumped into four categories:
quantum mechanical, atomistic, thermodynamic, and practical. There are
two types of quantum mechanical effects: confinement effects and
tunnelling effects. In conventional bulk and PD-SOI MOSFETs, quantum
confinement causes the average position of the channel carriers to be moved
a little farther from the interface. This weakens the effect of gate
insulator scaling by adding ~0.2 nm to EOT, but it is not a major concern. In
some of the novel device structures that are considered for the future,
however, confinement effects may play a more important role. At present,
quantum tunnelling of carriers through the energy barrier in the device is
generally a more important problem. This tunnelling results in leakage
current that increases power dissipation and decreases logic operating
margins.
Atomistic effects are due to the discreteness of matter. The primary
concern here is that there are a very small number of dopant atoms in a
highly scaled MOSFET, and statistic variations in the exact number of
dopant atoms can give rise to unacceptably large variations in terminal
characteristics. There may also be atomistic effects associated with
roughness scattering along the Si-insulator interface, but these have only
begun to be explored [19].
Thermodynamic effects are perhaps the most important, and take several
forms. First, the subthreshold behaviour of MOSFETs is governed by
Boltzmann statistics. Because of the thermal distribution of carriers, leakage
only falls off exponentially below The temperature of the carriers
determines the subthreshold slope and thus limits the scaling of the threshold
voltage, which cannot be scaled below some multiple of without
incurring excessive leakage current, where is Boltzmann's constant and T
is the temperature. Since is limited, supply voltage scaling is also limited.
In addition, the theoretical minimum supply voltage for self-consistent logic
is also determined by subthreshold slope. The second thermodynamic
Scaling Principles and Difficulties 21

consideration is that all of the energy used in computation is dissipated. It is


converted to heat that must be removed. Since conventional logic styles are
irreversible, all of the dynamic switching energy is dissipated every cycle.
This would not be a problem if the voltage could be fully scaled as in Table
2.2, but thermodynamics prevents that, as noted above. There exist
reversible computing schemes that can be implemented using CMOS
[20][21], but so far they have not proven practical enough for widespread
use. All of the leakage current in CMOS is also dissipative, consuming
additional power and generating more heat that must be removed.
The consumption of energy and removal of heat are not fundamental
limits but are constrained by practical limitations. Economic and
environmental considerations determine how much energy and heat a given
application can acceptably consume and generate. The consequences of
these constraints are application-dependent limits to scaling, which are
discussed in Section 4.

2.3.3.1 Tunnelling Through the Gate Insulator

There are three forms of tunnelling leakage of particular importance:


tunnelling through the gate insulator, band-to-band (zener) tunnelling
between the body and drain, and direct source-to-drain tunnelling through
the channel barrier. Of these, quantum tunnelling through the insulator
between gate and channel is the most prominent and well known of these
leakage currents. Figure 2.5 shows the dependence of these currents on
voltage and oxide thickness. In nFETs this current arises from the tunnelling
of electrons from the channel into the gate. In pFETs the tunnelling current
can be caused by hole tunnelling from channel to gate for very thin
insulators (< 1.5 nm) and low voltages, but at higher bias it is usually due to
the tunnelling of electrons from the valence band of the gate into the
conduction band of the body. The processes differ because the valence band
barrier height is ~4.5 eV, while the conduction band barrier is only ~3.5 eV
(see Figure 2.6).
Using the simple WKB approximation [22], tunnelling current varies as

where is the effective mass in the barrier in units of the


electron mass, and is the bias-dependent effective barrier height in
eV’s. Consequently, to limit gate tunnelling current, one must set a
22 CMOS Device Technology Trends for Power-Constrained Applications

minimum bound on the insulator thickness. Since scaling still requires


increasing the capacitance per unit area, one is forced to consider changing
to higher permittivity gate insulators, and this is the focus of much current
research, since it appears to be the only way to reach the ITRS’01 targets. A
high-k gate insulator may be characterized by three thicknesses: its physical
thickness , its equivalent oxide tunnelling thickness and its
equivalent oxide capacitive thickness By definition, all three are
equal for The goal is to find an insulator for which
This would enable further scaling, since one could decrease the capacitive
thickness without increasing the tunnelling current. At least initially, when
the gate insulator permittivity varies, the other device dimensions and
voltages can be scaled in keeping with rather than the physical
thickness or the equivalent oxide tunnelling thickness, since this maintains
the scaling of charge density [6]. It should be noted that depletion of the
poly-Si gate also plays a role in these considerations, since it typically
increases by 0.4-0.5 nm. Consequently, the use of high-k dielectrics in
combination with metal gates (which have negligible depletion) is a
preferred scenario.
The use of metal gates is complicated, however, not only by processing
difficulties, but also by workfunction considerations, since the of a
MOSFET depends on the workfunction of the gate. Most potentially useable
metals have workfunctions that tend to align near the middle of the Si
bandgap, making them more useful for devices, but not as good for
Scaling Principles and Difficulties 23

high-performance FETs. Metal gates also tend to suffer from


interface trapping and reliability concerns.
In addition to a high k, alternate gate insulators also need to present a
large barrier to the carriers in the Si so that tunnelling currents will not be
too large (Eq. 3). This means they need a large band gap and large band
offsets, or else they must be thick. Figure 2.6 shows the approximate
bandgaps and band offsets, as calculated by Robertson [24], for some of the
high-k materials that are being studied for possible use in CMOS. As can be
seen, there is a general trend for higher k materials to have lower bandgaps,
necessitating that these insulators be thicker than in order to limit
tunnelling current. Potential replacements for must also satisfy other
requirements, including thermal stability relative to Si at high processing
temperatures, low diffusion constants, thermal expansion matching, and low
interface traps One problem with high-k insulators in
FETs is that they usually cause the channel mobility to be degraded [25]. It
is not yet clear how this can be overcome. Presently the only successful
high-k insulators are Si oxynitride composites, with k ~ 5-6, but
and are also considered promising [26].

2.3.3.2 Junction Tunnelling

The second important source of tunnelling leakage current is band-to-band


tunnelling between the body and drain of an FET, which occurs when the
24 CMOS Device Technology Trends for Power-Constrained Applications

FET is in the “off” state. Presently this leakage primarily occurs in the form
of indirect band-to-band tunnelling through defects and deep traps in the
depletion region, which often dominates over direct tunnelling, and is a
problem in DRAM and ultra-low power circuits in which even very tiny
currents are important. But since this current is strongly dependent on the
electric field [28], it is expected that it will become problematic even for
high performance logic when the body doping reaches the 1019 regime.
Since direct band-to-band tunnelling depends on conduction band states
being lined up with valence band states, it can be avoided in bulk MOSFETs
when where is the drain-to-source voltage and is the
body-to-source voltage. Thus, tunnelling-free operation requires forward
body bias exceeding the supply voltage, At low temperature this might
be an interesting option [6], but it is unlikely that it would be applied to
anything except very high-performance computing.
Finally, it is possible for current to tunnel directly from source to drain
through the channel barrier. This effect has been studied both theoretically
and experimentally and has been observed for channel lengths below 20 nm,
especially at low temperature [29]. Most recent analyses show that such
tunnelling only becomes problematic at room temperature for channel
lengths below ~10 nm [5]. Since such short channel lengths will necessarily
be associated with very high performance, high power density applications,
this extra tunnelling current should be comparatively negligible in cases of
interest.

2.3.3.3 Discrete Doping Effects

The primary atomistic effect that may limit scaling is the discreteness of the
dopant atoms. The average concentration of doping is quite well controlled
by the usual ion implantation and annealing processes, but these processes
do not control the exact placement of each dopant. The resulting
randomness at the atomic scale causes spatial fluctuations in the local doping
concentration, resulting in device-to-device variation in MOSFET threshold
voltages. Within a few years it will be readily possible to make FETs whose
threshold voltages are controlled by fewer than 100 dopant atoms. The
uncertainty in the number of dopants, N, in any given device is expected to
vary as the square root of the number of dopants, in keeping with Poisson
statistics, so that the fractional uncertainty, and, hence, the threshold
variation, may become quite large, making the design of robust circuits
very difficult. This is especially true when one considers that the large
number of devices on a chip creates a statistical tail out to about 6 sigma.
Since, by the same reasoning, varies as narrow devices are
most affected by this effect.
Scaling Principles and Difficulties 25

The effects of doping fluctuations on the of MOSFETs have been


investigated by many workers. The most quantitatively accurate results use
randomly placed dopants in full 3D MOSFET simulations to fully resolve
the effects of dopant placement [33][32][31][30]. An example of such a
calculation is shown in Figure 2.7, which reveals the wide variation in
subthreshold behavior that is to be expected in an 11nm bulk MOSFET due
to random dopant placement. These 3D simulations were run using
FIELDAY [34] coupled with a pre-processor [31] to randomly place the
dopants. These dimensions represent the worst case (20% short) result for a
nominal 14 nm design point based on scaling from the published 25 nm
design of Taur, et al. [28]. Evidently, such a design point will be unusable
from a circuit point of view because of the very wide variation in threshold
voltage.

Based on such simulations, it appears that it will be very difficult to scale


conventional bulk MOSFETs below about 20 nm channel length and achieve
a reproducible, manufacturable process. One cannot predict this with
certainty, however, since there are several approaches to reducing this
problem, and more may be discovered. The most straightforward approach
for bulk devices is to move the dopants in the body back away from the
26 CMOS Device Technology Trends for Power-Constrained Applications

surface using highly retrograde channel doping profiles. Stochastic


simulations show that such profiles can yield up to 2 × lower uncertainty
than uniformly doped channels [30][31]. The uncertainty is lower because
the doping fluctuations are moved farther away from the channel and closer
to the body and so have less effect on the channel because they are screened
by the free carriers in the body. The best way to eliminate these fluctuations
is to remove the doping, and this may be possible in some novel FET
structures where the threshold may be set by the gate workfunction instead
of by doping [36][35].

2.3.3.4 Thermodynamic Effects

As was noted above, the subthreshold behavior of MOSFETs is controlled


by the Boltzmann distribution. As a result, cannot be fully scaled because
it is coupled to the off-current of the FET, and is constrained by
application considerations. is related to by

where S is the subthreshold swing and is the current at which is


defined. Since where is the ideality, the only way to
scale without also changing is to scale T. For high-end applications,
this is beginning to happen to some extent, through the use of chillers to
lower the junction temperature, but for most low power applications
significant cooling is not an option. There are two application
considerations that constrain it cannot be so high that the circuit doesn’t
function, and the total power dissipation associated with the leakage must be
tolerable for the given application. The latter constraint is usually more
important because of the enormous device density on modern chips.
Because of these constraints, in low-to-moderate power applications worst
case may be in the to A/cm range, resulting in minimum
between 0.54 and 0.27 V, respectively.
The thermodynamically determined subthreshold MOSFET behavior also
sets the minimum supply voltage for CMOS, independent of scaling. In
binary digital logic the fundamentally minimum permissible supply voltage
is the smallest voltage that is still large enough to maintain two distinct logic
states, and it was shown in the 1970s that this level is around [37].
Recently this estimate has been refined by analyzing the self-consistency
required for a combinatorial logic gate [6]. This self-consistency is based on
the observation that the two logic states are each identified with a relatively
narrow range of voltages, either high or low, and requires that a
Scaling Principles and Difficulties 27

combinatorial logic gate must be able to accept any possible combination of


inputs taken from these logic state ranges, and always produce an output
state that lies back in one of the logic state ranges.
The fundamental limits are found by considering conventional
series/parallel combinatorial circuits in the absence of noise and other logic
28 CMOS Device Technology Trends for Power-Constrained Applications

state degradation effects. The worst-case logic inputs must be identified, and
then the supply voltage must be adjusted so that even in the worst cases the
output state ranges are consistent with the inputs. Figure 2.8 shows an
example of using “eye” diagrams to determine the minimum fundamental
supply voltage for a simple CMOS 4-input NAND gate. Part (a) shows the
best- and worst-case bias conditions, which determine the upper and lower
bounds of the logic states. Figure 2.8(b) shows a case where is above
minimum. The logic swing is “large”; and the “eye”-diagram shows a small
amount of noise margin between the lowest-switching gate with only one
input changing and the highest-switching gate with all of its inputs changing.
The output state ranges from to and from to are isolated and
self-consistent, even though the range of input states does create some
spread.
When the logic swing is reduced too far (Figure 2.8(d)), the lowest and
highest curves no longer cross, indicating that there is no self-consistent
solution for and The lack of a self-consistent state means that
operating a long chain of such logic gates can result in the loss of the logic
signal [6]. Figure 2.8(c) shows the minimum logic swing condition: the
lowest and highest curves are exactly tangent at their intersection points (and
the noise margin is exactly zero).
Using this type of minimum logic swing condition, other logic families
and fan-ins have also been evaluated, and the minimum supply voltage is
found to vary roughly as ln(FI) for conventional devices in their
exponential regime, where FI is the fan-in. Since the lowest voltage results
occur for FETs in their subthreshold regime, where they present their
Power-constained Scaling Limits 29

maximum, exponential nonlinearity, one would need devices with stronger


nonlinearities to achieve smaller minimum logic swing. Using MOSFETs in
the conventional above-threshold manner decreases their overall nonlinearity
and increases the required minimum supply voltage as shown in Figure 2.9,
from 75 mV (FI=4) for pure subthreshold CMOS, to 207 mV for
at 300 K [6].
This fundamental minimum supply voltage cannot be considered a
realistic design point because it neglects very important non-idealities, such
as noise, tolerances, and short-channel MOSFET behavior. In addition,
transient behavior would show increased timing variability (due to
dependence on the input state) near the minimum logic swing limit because
of the asymmetric switching.

2.4 POWER-CONSTAINED SCALING LIMITS

2.4.1 Optimizing and

If threshold voltage were scaled according to the constant field scaling rules
in Table 2.2, power density due to the dissipation of the dynamic switching
energy (for irreversible computation) would remain constant but power
density due to subthreshold leakage would rise exponentially. On the other
hand, if one halts the scaling of voltage to prevent increasing subthreshold
dissipation (in effect setting in the generalized scaling rules), then the
power density associated with dynamic switching rises quadratically with
scaling. Consequently, if providing power or removing heat are costly or
inconvenient, then the thermodynamically determined subthreshold
MOSFET behavior forces one into an optimization situation. The optimum
and for a given application need to be set so as to minimize the
power dissipation while providing the desired speed. This sort of
optimization has been well studied, especially in the low power regime
[41][39][40], where the effects of process and supply variations are quite
important. The results of a study by Frank, et al. [40] are shown in Figure
2.10 as an example. Each point in the figure represents an independent
optimization of both the supply voltage and the threshold voltage. These
results, which include realistic tolerances, illustrate the dependence of the
optimum design points on activity factor and logic depth. As can be seen,
the optimum voltage can readily drop below 1 V and can even approach 0.5
V under some circumstances. These particular optimizations are for
static CMOS arithmetic circuits, but the optimal voltages are not expected to
vary much as technology is scaled (assuming the delay target is also scaled).
30 CMOS Device Technology Trends for Power-Constrained Applications

Since the optimum voltages depend strongly on activity factor and logic
depth, a wide range of and are needed to satisfy the requirements
of a range of applications. Note that these supply voltages are much larger
than the theoretical minimum supply voltages for subthreshold logic
discussed in the previous section.

2.4.2 Optimizing Gate Insulator Thickness and Gate Length


- the Optimal End to Scaling

The preceding optimization example assumed a fixed technology generation


in which only the could be varied. It did not take tunnelling through the
gate insulator into account and did not look at optimizing the complete
scaling of the transistors. In a world in which applications are up against the
practical limits to scaling associated with power generation and heat
removal, the next logical step is to extend the optimization to include the full
optimization of the transistor design. That such optima exist for power-
constrained applications can be seen by observing that if FETs are scaled too
far, the tunnelling leakage currents will increase, consuming power that
would otherwise be available for dynamic switching, thus lowering the
system performance. On the other hand, if they are not scaled far enough,
the switching speed will be slower, again lowering system performance.
Evidently, one should optimize the technology for a given application to
Power-constained Scaling Limits 31

maximize the system performance, subject to that application’s particular


power constraints.
Such optimisations have not yet been carried out, but estimates of where
these optima lie have been made [17] [6] based on reasonable estimates of
how the power should be divided between dynamic switching and leakage
mechanisms. The estimated scaling optima from [17] are presented in Table
2.3 for both bulk-like MOSFETs and a novel device structure, the double-
gate MOSFET (DG-FET), which will be discussed in Section 5. This table
is intended to show the general trends and dependencies of these limits,
rather than exact values. As can be seen, these limits depend significantly on
application, since different applications can tolerate different amounts of
static leakage power, so that there is no single end to scaling for a given
device technology, but rather there are different optimum ends to scaling for
different applications. High power, high performance servers can accept
much higher static leakage dissipation than portable battery-powered
devices, and so the former can be more aggressively scaled than the latter.
In creating these estimates, total power density is the overriding
parameter, and the leakage mechanisms are each allocated a certain fraction
of the total. Although complete optimizations will change the fractions
somewhat, the results in the table are only logarithmically dependent on
these values, so the conclusions should not change greatly. These estimates
also assume that the packing density of FETs is tied to their gate length.
This is more aggressive than the ITRS roadmap, which calls for the FETs to
scale more rapidly than the packing. If FETs are packed less densely, they
can be scaled further, but delay does not benefit as much as it would if the
wire were also scaled.
There are two types of circuit application in this table: for SRAM cells it is
assumed that essentially all of the power is static (i.e., very little activity)
and for logic circuits it is assumed that the switching activity is at least a few
percent and that static power is ~1/3 of the total power. The latter case
implicitly assumes that quiescent power dissipation requirements during
periods of long inactivity will be met by switching off the power supply. If
that is impossible for some applications, they will need higher thresholds,
thicker insulators, and less aggressive doping than their active power limits
would permit. Although multiple technologies will undoubtedly be present
on the same chip, the table is thought of as identifying the requirements for
the dominant technology on a given section of a chip. See the next section
for more discussion of this issue. The supply voltages are chosen on the
assumption that the circuits must be at least moderately fast with moderate
switching factors and good margins.
The methodology used to create Table 2.3 is described in detail in
[17] [6], but the basic idea is as follows. Starting with an approximate
32 CMOS Device Technology Trends for Power-Constrained Applications

channel length, the fraction of the power allocated to subthreshold


dissipation is used to determine The fraction allocated to gate current is
used to calculate the insulator thickness (an oxynitride gate insulator
is assumed here). The fraction allocated to band-to-band tunnelling
(together with the in the case of DG-FETs) is used to determine the
Given and the scale length is computed, from which a more
Power-constained Scaling Limits 33

accurate estimate of the nominal channel length is determined. This


procedure is iterated until converged. It is assumed that the circuit is
clocked fast enough to use up the fraction of the power that is allocated to
dynamic power. The optimizations in the table do not include discrete
doping effects, which may very well thwart the scaling of bulk MOSFETs
below 20 nm.
It can be seen very clearly from Table 2.3 that the scaling limits for
conventional planar CMOS depend on application power requirements.
Going from high power to low power applications, the decreasing leakage
requirements cause the optimum nominal channel length for bulk MOSFETs
to increase 3 × from ~13 nm to ~39 nm, while increases from 0.9 nm
to 2.6 nm, which corresponds to tunnelling current densities from 15 kA/cm2
to (at 1 V), respectively.

2.4.3 Discussion of the Optimizations

There are many aspects to this optimization analysis that deserve comment,
but only a few can be discussed here: the question of whether the power
targets are achievable, considerations involved in mixing high performance
FETs with lower performance FETs, and some comments about the
uncertainties of the calculations. For discussion of various other issues, see
[6] and [17].
Most of the optimizations in Table 2.3 assume that 60-70% of the power
is dissipated by the switching activity of the circuitry. At the very highest
power density this requires extremely active, heavily loaded
circuits, such as clock drivers, data bus drivers, or off-chip I/O drivers.
Random logic circuits made from these most-scaled FETs would be unlikely
to use so much dynamic power, so it is likely that the fraction of static power
in such circuits would be much higher, perhaps something like
static power and dynamic, for a total of
Moving down the power scale to less aggressive technology, it should be
relatively easy for even low power technology to reach active power
densities of For circuits at the low power end of the design
space, the challenge is to get the active power down to the required levels.
This is primarily a matter of circuit and system design and is largely the
subject of this book. Some of the more obvious approaches include the
following. (1) Since chips consist of a mixture of circuit blocks with varying
activity, one can average the more active circuits over the less active areas
and over large areas of lower dissipation SRAM or DRAM, thus reducing
the overall power density as much as an order of magnitude. (2) The clock
frequency can be reduced to just barely meet the throughput requirements,
which may enable a further reduction in although cannot be too
34 CMOS Device Technology Trends for Power-Constrained Applications

close to or else threshold variations will cause too much timing


uncertainty. (3) The chip can be run in bursts of power-optimized activity
and turned off between bursts. (4) The chip can be designed as a collection
of many special purpose macros, each power- or energy-optimized for its
own specific task. The processor would shuffle the work among the macros,
minimizing the energy consumed and increasing the averaging used in (1)
[6].
It is expected that most chips will be designed to use a mixture of
technologies to meet the varying needs of the system. At present the
available mixture usually consists of devices with a few different but
the same gate insulator thickness. If practical ways can be found to
manufacture multiple insulator thicknesses on the same wafer, then it should
become possible to have a more interesting mix of FET technologies on the
same chip. High less-scaled devices will be used for the low activity
SRAM cells, while more highly scaled low devices will be used in critical
logic paths. To reasonably balance the power usage, the fraction of high
power logic devices ought to vary roughly as where is the
power density of the dominant device technology and is the power
density of the high power logic devices. If the power is balanced in this
way, the total system power should vary roughly as where
is the power density of the highest power technology used [17]. An
example of this would be a high performance processor in which 70% of the
area is SRAM cells, 20% is logic technology, 7% is
2% is 0.7% is and 0.3% is
As a whole, this processor would dissipate which can be cooled
using reasonable technology. The practical economic question is whether
the cost of integrating all of these different technologies onto the same chip
will outweigh the benefits. This example also raises the issue of hot spots,
but the thermal analysis in [17] shows that the high thermal conductivity of
Si enables reasonable design constraints to successfully control this problem.
The accuracy of the scaling limit optimizations in Table 2.3 rests mostly
on the leakage current models, since the leakage currents are exponentially
dependent on the voltages and/or dimensions. On the other hand, this means
that the optimized voltages and dimensions are only (approximately)
logarithmically dependent on changes in the model parameters or the power
allocations. This insensitivity to assumptions can be seen in the observation
that dimensions only vary about a factor of three when the power density
constraint varies by six orders of magnitude. The threshold voltages should
be reasonably accurate, since they depend only on the definition itself and
the thermodynamic relationship of the subthreshold current to and the
ideality. The preferred relationship between and the tunnelling current
density varies a few Angstroms from one lab to another, so the current
Exploratory Technology 35

density should be considered the more fundamental parameter that is being


optimised here. Finally, the depletion depth determination is a little
uncertain because it is based on a band-to-band tunnelling model for which
there is relatively little experimental data. This area deserves much further
investigation because it may play a prominent role in the end of scaling.
Nevertheless, the results are not too uncertain, because even if band-to-band
tunnelling were entirely removed as a mechanism, one would still end up
with essentially the same optima by using the ideality factor as an
optimization variable, since the depletion depth can be determined from
and

2.5 EXPLORATORY TECHNOLOGY

Since conventional CMOS technology appears to be nearing the limits to


which it can be scaled, a variety of novel device ideas are being actively
explored as possible ways to go beyond those limits. Some of these ideas
represent radical changes in device structure, while others involve changing
the materials and processing. One of the more straightforward ideas is the
use of high-k gate dielectrics. As discussed in Section 3.3.1, this is essential
to achieving the effective oxide thicknesses needed to stay on the ITRS
projections, but the materials requirements appear very difficult to meet.
Some of the other exploratory ideas are discussed in the following sections.

2.5.1 Body- or Back-Gate Bias

Perhaps the simplest concept being investigated is the use of triple-well


technology to enable the application of separate body biases to different
blocks of nFETs and pFETs. The use of adjustable body bias is potentially
very useful for low power technology, since it allows the threshold voltages
to be adjusted after manufacture. It can be used in a feedback loop to
optimize threshold voltages for low supply voltages [42]. It can be used
remove chip-to-chip and run-to-run variation from the threshold distribution,
thus improving chip yield [44] [43] [67]. It can be used on a block-by-block
basis to reduce the leakage current of blocks that have been turned off [45].
Triple-well technology scarcely deserves to be called exploratory, since it is
already commonly used in the memory array areas of SRAM and DRAM
chips [46]. The novelty is that it may be offered for use in the logic areas
too. This technique may require some modified ground rules to prevent
latch-up, and it would probably be impractical to apply it on a very fine-
grained scale (e.g., on a device-by-device basis) because the deep n-well
36 CMOS Device Technology Trends for Power-Constrained Applications

used for the third well has a substantial lateral spread, but it may well be
very useful on a macro-to-macro scale.

2.5.2 Strained Si

The strained Si MOSFET is a fairly straightforward variation on


conventional bulk Si. The structure is illustrated schematically in Figure
2.11 (a), from which it can be seen to be very similar to a conventional
MOSFET. The difference is that biaxially tensile strained Si is used in the
channel. For the conduction band of <100> Si, this type of strain causes the
6-fold valley degeneracy of Si to split into two lower energy valleys and four
higher valleys. The lower energy valleys have low effective mass in the in-
plane transport direction, which significantly increases the mobility for
electron transport in the plane compared to what it is in bulk Si. The strain
also splits the degeneracy of the valence band at the point, yielding a
higher in-plane mobility for holes, too. For the inversion layer of n-channel
Exploratory Technology 37

MOSFETs, mobility improvements in excess of 70% have been observed for


strained Si, as shown in Figure 2.12(a). This higher mobility translates into
higher drive current, as shown in the experimental IV curves in Figure 2.12
(b) [48][47], and this is, of course, the entire motivation for using strained Si.
Higher drive current translates into either faster circuits at the same voltage,
or lower voltage and lower power circuits at the same speed.
38 CMOS Device Technology Trends for Power-Constrained Applications

The straining of the Si layer is accomplished by epitaxially growing it on


an unstrained SiGe layer. The strain of the Si layer increases in proportion
to the Ge mole fraction, which may be 10-35%. It is important that the Si
layer be kept quite thin (typically <20 nm) to prevent the development of
dislocations, which would be harmful to device characteristics. The SiGe
layer is grown epitaxially on the Si substrate, but the Ge composition is
graded and the total thickness is large enough that the strain is entirely
relaxed, and the dislocations are (hopefully) confined deep within the layer
where they will not harm the device. There are many material issues
involved in successfully growing and processing these various layers, and
much work still to be done before these MOSFETs can replace conventional
technology, but the results so far are encouraging [27].
It would be desirable to extend the strained Si concept to SOI, but this is
fairly difficult because of the thick SiGe layer that is used to strain the Si
layer. There are, nevertheless, several concepts being studied, including
wafer bonding/layer transfer and oxygen implantation into the SiGe
(SIMOX-like) [27].

2.5.3 Fully-Depleted SOI

The next more complex exploratory device structure is the fully -depleted
SOI (FD-SOI) MOSFET, which is illustrated in Figure 2.11(b). When
compared to Figure 2.2, one can see that FD-SOI is very similar to PD-SOI
except that the Si layer is much thinner. Typically, the Si layer should be
less than about half the depletion depth of a corresponding bulk device, to
guarantee that the layer remains fully depleted over the full range of gate
voltage. Under these circumstances, the floating-body effect of PD-SOI is
almost entirely eliminated except at very high drain voltages [8].
FD-SOI has long been studied because of its potential advantages over
bulk technology [49]. Various investigators have shown, however, that FD-
SOI has fairly poor scaling characteristics because there are no carriers or
conductors on the back side to screen the drain electric field [51] [50].
Recent simulations also indicate that it has worse short channel effects than
double-gate MOSFETs with the same Si thickness, as shown in Figure 2.13.
To achieve the same roll-off characteristics as DG-FETs, the FD-SOI Si
layers must be reduced to less than half the thickness of the DG-FET layers.
Nevertheless, FD-SOI does have some advantages compared to PD-SOI.
Floating body effects are eliminated, making circuit design easier. Parasitic
drain capacitance is reduced because the depth of the drain-to-body junction
is greatly reduced. The subthreshold slope is improved, making it possible
to scale and further. For example, recent experiments on 50 nm gate
Exploratory Technology 39

length FD-SOI devices showed a subthreshold swing of 75 mV/decade


(versus 85-90 mV/decade for bulk control samples) [53].
Some of the problems associated with FD-SOI include the difficulty of
controllably creating such thin layers of Si and difficulty in controlling the
threshold voltage because it is now sensitive to the thickness of the Si layer
and to possible trapped charge on the back interface. It is also difficult to
achieve low resistance source and drain contacts to such thin Si layers, but it
may be possible to overcome this problem through a raised source/drain
process, as shown in the excellent results of [53]. One other problem is self-
heating. Because has very low thermal conductivity, heat generated in
the drain of the FET cannot be removed very quickly, sometimes causing the
device to become quite hot. At high temperature, the mobility, the drive
current, and the performance are all degraded.
A major variation on the FD-SOI process is the inclusion of a second
gate below the device layer that serves as a ground plane with regard to
electrostatic properties. This is essentially an asymmetric form of the planar
double-gate MOSFET discussed below. To be effective, the back-gate must
not be too far from the device, so the back gate insulator must be fairly thin
(~3-10 nm). Experiments on such devices have shown a tunability range
of up to ~1 V [54]. Such a back-gate might be interesting for low power use
because its close proximity to the device might allow much finer-grained
control than is possible with the deep n-well in triple-well bulk technology.
40 CMOS Device Technology Trends for Power-Constrained Applications

2.5.4 Double-gate FET Structures

As alluded to above, a double-gate FET is a MOSFET with two gates, one


above and one below the channel [56][55][36][52], as shown schematically
in Figure 2.11(c). This device is the focus of much current research, since it
is the structure that seems most promising for exceeding the limits of
conventional bulk scaling. Unfortunately, it is also a very difficult structure
to make, which has slowed progress.
The basic idea is that a second gate screens the drain field just as well as
the body of a bulk MOSFET, if not better, thus keeping short channel effects
under control, as shown in Figure 2.13. But unlike the body of a bulk
device, the second gate can be switched in conjunction with the first gate to
effectively double the switching current of the FET under many
circumstances. The prospect for a better scaling limit than bulk devices
derives from the observation that eventually it ought to be possible to make
the Si layer thickness much thinner than the depletion layer of a bulk device,
and this enables the gate length to be scaled smaller, too. One early estimate
put this scaling limit at 30 nm, based on 3 nm oxide thicknesses [36], but
more recent theoretical analyses have been much more aggressive, showing
that high quality FET characteristics can probably be obtained even for
devices with gate length below 10 nm [5][57].
Although many asymmetric forms of the DG-FET have been considered,
involving different gate insulator thicknesses and different gate
workfunctions [58], the highest performance DG-FETs are symmetric, with
the same insulator and same gate workfunction on both sides [52]. For these
symmetric devices one can follow a derivation similar to that of Eq. 1,
yielding a scale length that is the largest solution of the equation [17]:

If both gates of a symmetric DG-FET are switched together, there is no


capacitor divider effect (like that in bulk devices between gate and body), the
ideality is near unity, and the subthreshold swing can be nearly ideal,
perhaps <70 mV/decade at room temperature. This should be very useful for
low power operation, since it will allow the and to be scaled further.
Another potential scaling advantage of DG-FETs is that it may not be
necessary to dope the Si channel if suitable gate material can be found, since
the workfunction of the gates could be used to set the In this case,
discrete doping fluctuations could be avoided, and scaling could proceed
further [36]. The threshold does, however, become quite sensitive to the
Exploratory Technology 41

thickness of the body when is scaled below ~5 nm, because the


quantization energy associated with confining an electron to such a thin layer
increases as Thus, tolerance concerns may constrain to 5 nm [52].
Like FD-SOI, floating body effects are not expected to be present in DG-
FETs. Nevertheless, direct tunnelling from the drain into the opposite
energy band of the channel may occur in an “off” device, and this would
contribute to leakage power dissipation. Since direct band-to-band
tunnelling depends on conduction band states being lined up with valence
band states, it can be avoided in undoped-channel DG-FETs if
[6].
In a general geometric sense there are three forms for the DG-FET,
depending on the plane and direction of current flow relative to the substrate:
planar, vertical, and fin-like, as shown in Figure 2.14, where they are labeled
I, II, and III, respectively. These three differ in how the channel length,
width and thickness dimensions are controlled, as indicated in the chart in
Figure 2.14 [8]. Planar (Type I) devices seem to offer the best possibility of
process control, since the smallest dimension, the thickness, is controlled as
a layer thickness, by polishing or perhaps growth, while the other two
dimensions are defined by lithography, as in bulk technology. Several
successful experiments have been performed to build planar DG-FETs
42 CMOS Device Technology Trends for Power-Constrained Applications

[58][59][69][68], but the fabrication processes for creating a lower gate


properly aligned to the upper gate have proven quite difficult. In addition,
the same thin Si layer problems exist here as in FD-SOI.
Vertical DG-FETs (Type II) have also been built experimentally, with
reasonable success [61] [60]. The difficulty here is that the smallest
dimension, must be defined lithographically. It is also difficult to
fabricate the drain or source contact on top of the narrow “wall”. On the
positive side, the channel length can be defined quite accurately by a layer
thickness, which would tend to limit one of the major sources of variation in
conventional devices.Experimentally, it turns out that FinFETs (Type III
DG-FETs) have been the easiest to make, at least for exploratory purposes
[63][65][64][62]. The fins are made by etching an SOI layer so that the
device width is determined by the layer thickness while and the gate
length are determined by lithographic processing. For manufacturing
purposes it is not yet clear whether or not can be adequately controlled in
this manner, but for use in research these devices are fairly easy to build.
Using conventional FET layout, the active area mask is modified in the gate
area in order to create many parallel minimum feature size lines running
from source to drain. After etching, these become the fins through which the
current flows. The gate insulator is then grown and the gate is deposited
conformally, after which it is patterned lithographically and etched, yielding
a gate that wraps around both sides of the fin, see Figure 2.15. If the height
of the fins exceeds half their pitch, these devices should have more drive
current per unit layout width than conventional MOSFETs. Figure 2.16
plots recent electrical results, showing very high current drive for 30 nm
channel-length FinFETs [64].
Exploratory Technology 43

Since DG-FETs appear to be promising alternatives to bulk MOSFETs,


they were also included in the optimization procedures described in Section
4. According to the results in Table 2.3, they show up to a 30% scaling
advantage over bulk. The largest advantage occurs for intermediate power
levels and low where it is equivalent to an entire generation of scaling,
but the advantage is lost for the high high cases because the DG-FET
is more impacted by body-to-drain tunnelling, though this may be partly an
artifact of the model. The DG-FET’s advantage is also reduced at high
power densities, partly because there is more gate insulator area per of
Si (for the assumed geometry) and so the insulator tunnelling power
constraint necessitates slightly thicker (~0.1 nm) gate insulators, and partly
because of the 4 nm minimum that was imposed for the sake of tolerance
control. A further consideration is that the optimizations assumed a planar
DG-FET geometry, but if a FinFET geometry were assumed for the DG-
FET, the gate insulator area per of Si could be even more than twice that
of bulk CMOS, requiring a still thicker gate insulator to hold tunnelling
dissipation in check. Since DG-FET gate capacitance per of substrate
may also be at least twice that of bulk, the constraint on dynamic power
density forces the use of lower clock frequency, narrower devices and lower
supply voltages (which should be possible because of the improved
subthreshold slope). Narrower devices result in tighter logic gate pitch,
44 CMOS Device Technology Trends for Power-Constrained Applications

shorter interconnects and lower wiring capacitance, all of which are useful
for low energy computing [17].
Will some type of double-gate device eventually supplant bulk CMOS?
It is difficult to predict at this point in time. Discrete doping issues may very
well prohibit bulk designs below 20 nm, which greatly increases the DG-
FETs’ advantage. On the other hand, DG-FET design points below 20 nm
probably require halo-like roll-off compensation and metal gates with
suitable workfunctions to set neither of which are known processes.
Furthermore, DG-FET currents are likely to be degraded because most
geometries are expected to suffer from self-heating effects, like other SOI
devices.

2.5.5 Low Temperature Operation for High Performance

There is one more technology option that should be considered for high
performance computing. This is the possibility of running high performance
processors at low temperatures, perhaps 100-150 K. This option does not
require significant device modifications, yet it addresses many of the issues
that limit conventional scaling. First, the threshold voltages should be able
to scale with the operating temperature T, since the subthreshold swing
scales with T, and according to Eq. 4, this would keep constant. As a
result, the supply voltages can also scale. Following this type of scaling,
dynamic power dissipation varies as while the energy required for ideal
refrigeration only varies as where is the temperature at which
heat leaves the system, e.g., ~350 K. Thus, even taking into account the
inefficiency of real heat pumps, it should be possible to break even on the
total room temperature power dissipation.
Furthermore, low temperature improves the mobility of the transistors
and lowers the resistance of the wires, both of which increase performance.
The use of lower voltage supplies would also lower the tunnelling current
through the gate insulator significantly, which would enable further scaling.
Another advantage would be that the reliability of the circuits, which is a
great concern for future technologies, would be greatly enhanced at low
temperature, since most failure mechanisms are at least partially thermally
activated and would therefore be highly suppressed. At low temperature,
DRAM retention time would probably increase so much that it could be
treated as non-volatile, possibly enabling different types of memory design.
Finally, for bulk devices low temperature might enable the use of forward
body bias, which could lower the transverse field, improving the mobility,
and shrink the depletion depth, enabling further scaling.
Summary 45

Low temperature operation will probably not help low power


applications, but for all of the above reasons it seems very promising for
high performance use.

2.6 SUMMARY

CMOS technology has made phenomenal progress in the past 30 years: it is


now possible to put one billion bits in a single DRAM chip and 100 million
transistors in a single chip processor. To maintain its current economic
model, the industry would very much like this trend to continue and has
created the ITRS roadmaps to help bring this to pass. Even the present
CMOS technology is very powerful for creating a wide range of
applications, from portable low power personal digital assistants to high-end
servers, and the future technology is expected to be even more powerful.
Unfortunately, these plans for the future are imperilled by a variety of non-
scaling physical effects, including the thermodynamically controlled
subthreshold behavior, quantum tunnelling of carriers through the gate
insulator and through the body-to-drain junction, and discrete doping effects.
Although several of these effects may have the potential to halt the scaling of
CMOS by making circuits non-functional, this is not the primary concern.
Rather, the most important limit is the power dissipated in the various
leakage mechanisms. One can try to reduce this leakage dissipation by
various circuit techniques, but in the end it leads to a whole range of
application-dependent limits to scaling, each optimized to the given
application’s own constraints on how much leakage dissipation is tolerable.
The range of limits spans at least a factor of at least three in minimum FET
dimensions and gate insulator thickness, creating the need for a wide range
of technology at the end of scaling so that circuit designers can choose the
most appropriate devices for a given application. Since the precise positions
of these optimized scaling limits depend on material and device structure
properties, it may be possible to progress to better limits by changing the
device materials or structures. Some of these options have been
summarized, including double-gate MOSFET structures, which are the focus
of much research. The optimized scaling limits for DG-FETs have been
compared to those for bulk, and it presently appears that DG-FETs will hold
an advantage in the end for many applications, but the size of this advantage
depends on details of band-to-band tunnelling and discrete doping effects
that have yet to be thoroughly explored.
46 CMOS Device Technology Trends for Power-Constrained Applications

ACKNOWLEDGEMENT

This work has benefited greatly from many useful discussions with co-
workers and colleagues, including Bob Dennard, Wilfried Haensch, Ken
Rim, Ed Nowak, Paul Solomon, Yuan Taur, and H.-S. Philip Wong.

REFERENCES
[1] J. E. Lilienfeld. Method and apparatus for controlling electric currents. U.S. Patent
1745175, 1930.
[2] D. Kahng and M. M. Atalla, “Silicon–silicon dioxide field induced surface devices,”
Presented at IRE Solid-State Device Res. Conf., Pittsburgh, PA, June 1960.
[3] P. K. Bondy, “Moore’s law governs the silicon Revolution,” Proc. IEEE, 86, pp. 78-81,
Jan. 1998.
[4] Semiconductor Industry Association (SIA). International Technology Roadmap for
Semiconductors, 2001 Edition. Austin, Texas: SEMATECH, USA., 2706 Montopolis
Drive, Austin, Texas 78741, USA (http://public.itrs.net), 2001.
[5] Y. Naveh and K. K. Likharev, “Modeling of 10-nm-scale ballistic MOSFETs,” IEEE
Elec. Dev. Lett., 21, pp. 242-244, 2000.
[6] J. Frank, R. H. Dennard, E. Nowak, P. M. Solomon, Y. Taur and H.-S. P. Wong,
“Device scaling Limits of Si MOSFETs and their application dependencies,” in Proc.
IEEE, 89, pp. 259-288, 2001.
[7] Y. Taur, D. Buchanan, W. Chen, D. Frank, K. Ismail, S.-H. Lo, G. Sai-Halasz, R.
Viswanathan, H.-J. C. Wann, S. Wind and H.-S. Wong, “CMOS scaling into the
nanometer regime,” in Proc. IEEE, 85, pp. 486–504, April 1997.
[8] H.-S. P. Wong, D. J. Frank, P. M. Solomon, H.-J. Wann and J. Welser, “Nanoscale
CMOS,” in Proc. IEEE, 87, pp. 537-570, 1999.
[9] Sai-Halasz, “Performance trends in high-end processors,” in Proc. IEEE, 83, pp. 20, Jan.
1995.
[10] J. W. Sleight, P. R. Varekamp, N. Lustig, J. Adkisson, A. Allen, O. Bula, X, Chen, T.
Chou, W. Chu, J. Fitzsimmons, A. Gabor, S. Gates, P. Jamison, M. Khare, L. Lai, J. Lee,
S. Narasimha, J. Ellis-Monaghan, K. Peterson, S. Rauch, S. Shukla, P. Smeys, T.-C. Su,
J. Quinlan, A. Vayshenker, B. Ward, S. Womack, E. Barth, G. Blery, C. Davis, R.
Ferguson, R. Goldblatt, E. Leobandung, J. Welser, I. Yang and P. Agnello, “A high
performance SOI CMOS technology with a 70 nm silicon film and with a
second generation low-k Cu BEOL,” In IEDM Tech. Dig., pp. 245-248, 2001.
[11] Auberton-Hervé, “SOI: materials to systems,” In 1996 IEDM Tech. Dig., pp. 3, 1996.
[12] R. Puri and C. T. Chuang, “SOI digital circuits: design issues,” In Thirteenth Int. Conf.
VLSI Design, 2000., pp. 474 -479, 2000.
[13] Stathis, J.H, “Physical and predictive models of ultrathin oxide reliability in CMOS
devices and circuits,” IEEE Trans. Device and Materials Reliability, 1(1), Pp.(s): 43 -59,
March 2001.
[14] R.H. Dennard, F.H. Gaensslen, H.N. Yu, V.L. Rideout, E. Bassous and A.R. LeBlanc,
“Design of Ion-implanted MOSFETs with very small physical dimensions,” Jour. Solid
St. Circuits, SC-9, pp. 256-268, 1974.
Summary 47

[15] Davari, R. H. Dennard and G. G. Shahidi, “CMOS scaling, the next ten years,” In Proc.
IEEE, 89, pp. 595-606, 1995.
[16] D. J. Frank, Y. Taur and H.-S. P. Wong, “Generalized scale length for two-dimensional
effects in MOSFET's,” IEEE Elec. Dev. Lett., 19, pp. 385-387,1998.
[17] J. Frank, “Power-constrained CMOS scaling limits,” IBM J. Res. Devel., 46(2/3),
March/May 2002.
[18] P. M. Solomon and I. J. Djomehri, “Overscaling, design for the future,” IBM Research
Report, RC22379, Jan. 2002.
[19] Asenov, S. Kaya and J. H. Davies, “Intrinsic threshold voltage fluctuations in decanano
MOSFETs due to local oxide thickness variations,” IEEE Trans. Electron Devices,
49(1), pp. 112 -119, Jan. 2002.
[20] W. Athas, N. Tzartzanis, W. Mao, L. Peterson, R. Lal, K. Chong, Joong-Seok Moon, L.
Svensson and M. Bolotski, “The design and implementation of a low-power clock-
powered microprocessor,” IEEE J. Solid-State Circuits, 35(11), pp. 1561 -1570, Nov.
2000.
[21] D. J. Frank, “Comparison of high speed voltage-scaled conventional and adiabatic
circuits,” In 1996 Int. Symp. Low Power Electronics and Design (ISLPED), Digest of
Tech. Papers, pp. 377, 1996.
[22] S. M. Sze. Physics of Semiconductor Devices, 2nd Edition. John Wiley & Sons, 1981.
[23] S.-H. Lo, D.A. Buchanan, Y. Taur and W. Wang, “Quantum-mechanical modeling of
electron tunneling current from the inversion layer of ultra-thin-oxide nMOSFET's,”
IEEE Electron Dev. Lett., 18, pp. 209, 1997.
[24] J. Robertson, “Band offsets of wide-band-gap oxides and implications for future
electronic devices,” J. Vacuum Science and Technology B, 18(3), pp. 1785-1791,2000.
[25] M. Fischetti, D. Neumayer and E. Cartier, “Effective electron mobility in Si inversion
layers in MOS systems with a high-k insulator: the role of remote phonon scattering," J.
Appl. Phys., 90(9), pp. 4587, 2001.
[26] D. Barlage, R. Arghavani, G. Dewey, M. Doczy, B. Doyle, J. Kavalieros, A. Murthy, B.
Roberds, P. Stokley and R. Chau, “High-frequency response of 100nm integrated CMOS
transistors with high-k gate dielectrics,” In IEDM Tech. Dig., pp. 231-234, 2001.
[27] H.-S. P. Wong, “Beyond the Conventional Transistor,” IBM J. Res. Devel., 46(2/3),
March/May 2002.
[28] Y. Taur, C. H. Wann and D. J. Frank, “25 nm CMOS Design Considerations,” In IEDM
Tech. Dig., pp. 789-792, 1998.
[29] H. Kawaura, T. Sakamoto and T. Baba, “Direct source-drain tunneling current in
subthreshold region of sub-10-gate EJ-MOSFETs,” In 1999 Si Nanoelectronics
Workshop Abstracts, pp. 26-27, 1999.
[30] Asenov and S. Saini, “Random dopant fluctuation resistant decanano MOSFET
architectures,” In 1999 Si Nanoelectronics Workshop Abstracts, pp. 84-85, June 1999.
[31] D. J. Frank, Y. Taur, M. leong and H.-S. P. Wong, “Monte Carlo modeling of threshold
variation due to dopant fluctuations,” In Symp. VLSI Technol., pp. 169-170, 1999.
[32] H.-S. P. Wong, Y. Taur and D. Frank, “discrete random dopant distribution effects in
nanometer-scale MOSFETs,” Microelectronic Reliability, 38, pp. 1447-1456, 1998.
[33] H.-S. P. Wong and Y. Taur, “Three-Dimensional `atomistic' simulation of discrete
microscopic random dopant distributions effects in MOSFETs,” In IEDM
Tech. Dig., pp. 705-708, 1993.
[34] E. Buturla, J. Johnson, S. Furkay and P. Cottrell, “A new 3-D device simulation
formulation. In NASCODE VI: Sixth International Conf. on the Numerical Analysis of
Semiconductor Devices and Integrated Circuits,” Boole Press, Dublin, pp. 291, 1989.
48 CMOS Device Technology Trends for Power-Constrained Applications

[35] D. J. Frank and H.-S. P. Wong, “Simulation of stochastic doping effects in Si


MOSFETs,” In Proc. Int. Workshop on Computational Electronics, pp. 2-3, May 2000.
[36] D. J. Frank, S. E. Laux and M. V. Fischetti, “Monte Carlo simulation of a 30 nm dual-
gate MOSFET: How far can Si go?,” In IEDM Tech. Dig., pp. 553,1992.
[37] R. M. Swanson and J. D. Meindl, “Ion-implanted complementary MOS transistors in
low-voltage circuits,” IEEE J. Solid-State Circuits, SC-7, pp. 146–153, April 1972.
[38] J. R. Brews, “Physics of the MOS transistor,” In Applied Solid State Science. New York:
Academic, pp. 1–120, 1981.
[39] Z. Chen, J. Burr, J. Shott and J. D. Plummer, “Optimization of quarter micron MOSFETs
for low voltage/low power applications,” In IEDM Tech. Dig., pp. 63-65, 1995.
[40] D. J. Frank, P. Solomon, S. Reynolds and J. Shin, “Supply and threshold voltage
optimization for low power design,” In Proc. 1997 Int. Symp. Low Power Electronics
and Design, pp. 317-322, 1997.
[41] D. Liu and C. Svensson, “Trading speed for low power by choice of supply and treshold
voltages,” IEEE J. Solid-State Circ., 28, pp. 10,1993.
[42] V. R. von Kaenel, M. D. Pardoen, E. Dijkstra and E. A. Vittoz, “Automatic adjustment
of threshold and supply voltages for minimum power consumption in CMOS digital
circuits,” In IEEE Symp. Low Power Electronics, Digest of Technical Papers, pp. 78 -
79, 1994.
[43] S. Narendra, M. Haycock, R. Mooney, V. Govindarajulu, V. Erraguntala, H. Wilson, S.
Vangal, A. Pangal, E. Seligman, R. Nair, N. Borkar, J. Hofsheier, S. Menon, B.
Bloechel, G. Dermer, S. Borkar and V. De, “1.1V 1 GHz communications router with on-
chip body bias in 150nm CMOS,” In Proc. 2002 ISSCC, paper 16.4, 2002.
[44] S. Narendra, D. Antoniadis and V. De, “Impact of using adaptive body bias to
compensate die-to-die Vt variation on within-die Vt variation,” In Int. Symp. Low Power
Electronics and Design, pp. 229 -232, 1999.
[45] S. V. Kosonocky, M. Immediato, P. Cottrell, T. Hook, R. Mann, J. Brown, “Enhanced
multi-threshold (MTCMOS) circuits using variable well bias,” In Int. Symp. Low Power
Electronics and Design. pp. 165-169, 2001.
[46] M. Norishima, H. Yoshinari, H. Hayashida, T. Eguchi, K. Kasai, H. Shinagawa, T.
Matsunaga, T. Matsuno, H. Shibata, Y. Toyoshima, K. Hashimoto, “High-performance
CMOS technology for logic LSIs with embedded large capacity SRAMs,” In
IEDM Tech. Dig., pp. 489-492, 1991.
[47] K. Rim, private communication.
[48] K. Rim, S. Koester, M. Hargrove, J. Chu, P. M. Mooney, J. Ott, T. Kanarsky, P.
Ronsheim, M. Ieong, A. Grill and H.-S. P. Wong, “Strained Si NMOSFETs for high
performance CMOS technology,” In Symp. VLSI Tech., pp. 59, 2001.
[49] J.-P. Colinge, “Thin-film SOI technology: the solution to many submicron CMOS
problems,” In IEDM Tech Dig., pp. 817-820, 1989.
[50] L. T. Su, H. Hu, J. B. Jacobs, M. Sherony, A. Wei and D. A. Antoniadis, “Tradeoffs of
current drive vs. short-channel effect in deep-submicrometer bulk and SOI MOSFET's,”
In IEDM Tech. Dig., pp. 649, Dec. 1994.
[51] Warm, F. Assaderaghi, L. Shi, K. Chan, S. Cohen, H. Hovel, K. Jenkins, Y. Lee, D.
Sadana, R. Viswanathan, S. Wind and Y. Taur, “High-performance CMOS with
9.5-ps gate delay and 150GHz f t,” IEEE Electron Dev. Lett., 18, pp. 625, 1997.
[52] H.-S. P. Wong, D. J. Frank and P. M. Solomon, “Device Design Considerations for
Double-gate, Ground-plane, and Single-gated Ultra-thin SOI MOSFETs at the 25 nm
Channel Length Generation,” In IEDM Tech. Dig., pp. 407-410, 1998.
Summary 49

[53] R. Chau, J. Kavalieros, B. Doyle, A. Murthy, N. Paulsen, D. Lionberger, D. Barlage, R.


Arghavani, B. Roberds and M. Doczy, “A 50nm depleted-substrate CMOS transistor
(DST),” In IEDM Tech, Dig., pp. 621-624, 2001.
[54] Yang, C. Vieri, A. Chandrakasan and D. Antoniadis, “Back-gated CMOS on SOIAS for
dynamic threshold voltage control,” IEEE Trans. Elec. Dev., 44, pp. 822, 1997.
[55] Fiegna, H. Iwai, T. Wada, T. Saito, E. Sangiorgi and B. Ricco, “A new scaling
methodology for the MOSFET,” In Symp. VLSI Technology, pp. 33,
1992.
[56] T. Sekigawa and Y. Hayashi, “Calculated threshold-voltage characteristics of an XMOS
transistor having an additional bottom gate,” Solid State Electronics, 27, pp. 827, 1984.
[57] G. Pikus and K. K. Likharev, “Nanoscale field-effect transistors: An ultimate size
analysis,” Appl. Phys. Lett., 71(25), pp. 3661-3663, Dec. 1997.
[58] T. Tanaka, K. Suzuki, H. Horie and T. Sugii, “Ultrafast low-power operation of
double-gate SOI MOSFETs,” In Symp. VLSI Technology, pp. 11, June 1994.
[59] H.-S. Wong, K. Chan and Y. Taur, “Self-aligned (top and bottom) double-gate MOSFET
with a 25 nm thick silicon channel,” In IEDM Tech. Dig., pp. 427-430,1997.
[60] J. M. Hergenrother, G. D. Wilk, T. Nigam, F. P. Klemens, D. Monroe, P. J. Silverman,
T. W. Sorsch, B. Busch, M. L. Green, M. R. Baker, T. Boone, M. K. Bude, N. A.
Ciampa, E. J. Ferry, A. T. Fiory, S. J. Hillenius, D. C. Jacobson, R. W. Johnson, P.
Kalavade, R. C. Keller, C. A. King, A. Komblit, H. W. Krautter, J. T.-C. Lee, W. M.
Mansfield, J. F. Miner, M. D. Morris, S.-H. Oh, J. M. Rosamilia, B. J. Sapjeta, K. Short,
K. Steiner, D. A. Muller, P. M. Voyles, J. L. Grazul, E. J. Shero, M. E. Givens, C.
Pomarede, M. Mazanec and C. Werkhoven, “50 nm vertical replacement-gate (VRG)
nMOSFETs with ALD HfO2 and A1203 gate dielectrics,” In IEDM Tech. Dig., pp. 51-
54, Dec. 2001.
[61] J. M. Hergenrother, D. Monroe, F. P. Klemens, A. Komblit, G. R. Weber, W. M.
Mansfield, M. R. Baker, F. H. Baumann, K. J. Bolan, J. E. Bower, N. A. Ciampa, R. A.
Cirelli, J. I. Colonell, D. J. Eaglesham, J. Frackoviak, H. J. Gossmann, M. L. Green, S.
Hillenius, C. King, R. Kleiman, W. Y. C. Lai, J. T.-C. Lee, R.-C. Liu, H. Maynard, M.
Moris, S.-H. Oh, C.-S. Pai, C. Rafferty, J. Rosamilia, T. Sorsch and H.-H. Vuong, “The
Vertical Replacement-Gate (VRG) MOSFET: a 50-nm vertical MOSFET with
lithography-independent gate length,” In IEDM Tech. Dig., pp. 75-78, Dec. 1999.
[62] Y.-K. Choi, N. Lindert, P. Xuan, S. Tang, D. Ha, E. Anderson, T.-J. King, J. Boker and
C. Hu, “Sub-20nm CMOS FinFET technologies,” In IEDM Tech. Dig., pp. 421, 2001.
[63] 99 X. Huang, W.-C. Lee, C. Ku, D. Hisamoto, L. Chang, J. Kedzierski, E. Anderson, H.
Takeuchi, Y.-K. Choi, K. Asano, V. Subramanian, T.-J. King, J. Bokor and C. Hu, “Sub
50-nm FinFET: PMOS,” In IEDM Tech. Dig., pp. 67-70, Dec. 2001.
[64] J. Kedzierski, D. M. Fried, E. J. Nowak, T. Kanarsky, J. H. Rankin, H. Hanafi, W.
Natzle, D. Boyd, Y. Zhang, R. A. Roy, J. Newbury, C. Yu, Q. Yang, P. Saunders, C. P.
Willets, A. Johnson, S. P. Cole, H. E. Young, N. Carpenter, D. Rakowski, B. A. Rainey,
P. E. Cottrell, M. Ieong and H.-S. P. Wong, “High-performance symmetric-gate and
CMOS-compatible VT asymmetric-gate FinFET devices,” In IEDM Tech. Dig., pp. 437-
440,2001.
[65] N. Lindert, L. Chang, Y.-K. Choi, E. Anderson, W.-C. Lee, T.-J. King, J. Bokor and C.
Hu, “Sub-60-nm quasi-planar FinFETs fabricated using a simplified process,” IEEE
Electron Dev. Lett., 22(10), pp. 487-489, 2001.
[66] D. Rabe and W. Nebel, “Short circuit power consumption of glitches,” In Int. Symp.
Low Power Electronics and Design, Dig. Tech. Papers, pp. 125-128,1996.
50 CMOS Device Technology Trends for Power-Constrained Applications

[67] J. Tschanz, J. Kao, S. Narendra, R. Nair, D. Antoniadis, A. Chandrakasan and V. De.,


“Adaptive body-bias for reducing impacts of die-to-die and within-die parameter
variations on microprocessor frequency and leakage,” In 2002 ISSCC, paper 25.7, 2002.
[68] K. W. Guarini, P. M. Solomon, Y. Zhang, K. K. Chan, E. C. Jones, G. M. Cohen, A.
Krasnoperova, M. Ronay, O. Dokumaci, J. J. Bucchignano, C. C. Jr., C. Lavoie, V. Ku,
D. C. Boyd, K. S. Petrarca, I. V. Babich, J. Treichler, P. M. Kozlowski, J. S. Newbury,
C. P. D’Emic, R. M. Sicina and H.-S. Wong, “Triple-self-aligned, planar double-gate
MOSFETs: Devices and circuits,” In IEDM Tech. Dig., pp. 425-428, 2001.
[69] J.-H. Lee, G. Tarashi, A. Wei, T. Langdo, E. A. Fitzgerald and D. Antoniadis, “Super
self-aligned double-gate (SSDG) MOSFETs utilizing oxidation rate difference and
selective epitaxy,” In IEDM Tech. Dig., pp. 71-74, Dec. 1999.
Chapter 3
Low Power Memory Design

Yukihito Oowaki and Tohru Tanzawa


SoC R&D Center, Toshiba Corp.

Abstract: This chapter describes techniques and issues for power aware design of
memories. The focus is on non-volatile flash memories, non-volatile
ferroelectric memories and embedded DRAMs, which are becoming
increasingly important in the Wireless/Internet era.

Key words: Semiconductor memory, low power memory design, DRAM, SRAM, flash
memory, FeRAM.

3.1 INTRODUCTION

Semiconductor memories are classified by several characteristics. Memories


can be characterized by storage mode (Volatile or Non-volatile), read/write
endurances, operation speed, cell size, capacity, and operation voltages.
Figure 3.1 compares the features of major memories projected for 2003-
2005. DRAM and SRAM are volatile random access memories. NOR type
and NAND type flash memories are non-volatile floating-gate cell
memories. Floating-gate cells require much longer write time than read
access time, therefore, they are also called electric programmable read-only
memory. FeRAM and Magnetic RAM are non-volatile, and their write time
is as fast as their read access time. They are non-volatile random access
memories. The various memories are used in different applications based on
their characteristics and properties.
Low power LSI memory technology is becoming increasingly important.
It is commonly accepted that the density of semiconductor memories have
increased four-fold about every three years with linear scaling rule as per
Moore’s Law. Personal computers have dominated the high technology
market, and therefore low-power technologies for stand-alone DRAMs have
been quite important and thus intensively studied [1][2]. They are, of course,
52 Low Power Memory Design

increasingly important since internet/wireless electronic products have


demonstrated higher growth than personal computers recently [3]. Memory
technology suitable for those new applications is required. As the
internet/wireless electronics, especially mobile electronic apparatuses
become a major application, lowering both active and standby power while
meeting the ever-increasing performance/capacity requirements is a critical
issue. Non-volatility of memory is one of the most desirable features of this
class of electronics because standby power can theoretically be as small as
zero. An embedded memory solution that can reduce the active power
consumption by eliminating the external I/O bus power consumption is also
desirable for mobile apparatuses from the view point of power and
packaging density. This chapter describes low-power memory design,
focusing on those for non-volatile memories and embedded DRAMs.

3.2 FLASH MEMORIES

Since the flash concept was first reported by Masuoka et al. [4] at the 1984
IEDM, several different cell variations and many circuit techniques have
been developed and commercialized, and in turn, applications for flash
memories have proliferated greatly. The flash cells, e.g., NOR [5], NAND
[6], DINOR [7], AND [8], SanDisk cell [9], and SST cell [10], are divided
into two groups; one aims at fast random access, and the other aims at high-
bit density. A NOR flash cell is a cell typical of the former group. Figure
3.3(a) illustrates a cross-sectional view of a stacked gate flash memory cell.
The memory cell has no serially-connected transistors and a relatively large
cell current of 50-100uA for its high speed sensing operation. Early
applications of the NOR flash were for program code storage in PC BIOS,
Flash Memories 53

disk drives, automotive engines, and so on, which require a random access
time of less than 100ns. Recently, the NOR flash has also been used in
handheld digital equipment, e.g., cellular phones, messaging pagers, and
flash-embedded logic devices. In addition to a fast access time, the NOR
flash memories are required to have low-power consumption for longer
battery life and low-voltage operation in accordance with the lowering of the
minimum supply voltage of the other logic and analog devices mounted on
the same board or merged on the same chip. Figure 3.1 shows the trend of
the supply voltage of flash memories. Originally, NOR-based flash devices
had two supply voltages: 5V for read and 12V for program and erase [5]. In
1989, the first 5V-only NOR flash memory was introduced [11], and the
2.7V-only NOR flash was reported seven years later [12]. Currently, 1.8V-
only NOR flash memories [13] are used for 1.8V systems.
On the other hand, a NAND flash cell, as shown in Figure 3.6(a), is a
typical cell aimed at high-bit density. The NAND string has two select
transistors, one source line, and one bit-line contact for eight to thirty-two
series-connected flash cells [6]. Thus, the NAND flash memory has the
smallest cell size, and so the bit-cost is the lowest among flash memories.
Although the target of the NAND flash memory was originally replacement
of magnetic hard disks and floppy disks [6], recently other applications have
54 Low Power Memory Design

expanded, and NAND flash memories are used in digital cameras and solid-
state audio players. Using data compression techniques such as MPEG-1
Audio Layer 3 (MP3), flash memory cards supporting 64MB capacity can
store 1-2 hours of CD-quality music. Digital cameras and silicon audio
players utilize flash memory cards such as SmartMedia™ on which only one
or two NAND flash memory chips are mounted, and SD Card™ on which a
flash controller and some flash memories are mounted. In accordance with
the lowering of the supply voltage of flash controller devices to sub-2V, the
NAND flash memory and the NOR flash memory are required to operate at
the same supply voltage as that of the controller devices for simplicity in the
low-power systems.
This section reviews several control schemes and circuits peculiar to
flash memories such as charge pump circuits, level shifter, and sense amp
for low-power NOR and NAND flash memories, as shown in Figure 3.2.
Flash Memories 55

3.2.1 Flash Memory Cell Operation and Control Schemes

3.2.1.1 NOR Flash Memory

Basic memory cell operation


Figure 3.3(d) shows a typical bias condition of NOR flash memory cells. In
program mode, the word-line and bit-line of the cell selected for 0-
programming are set to about 10V and 5V, respectively, whereas the other
word-lines and bit-lines are grounded. Hot electrons are generated at the
drain edge of the selected cell and injected into the floating gate. As a result,
the threshold voltage of the 0-programmed cell is higher than that of the cell
in the erased state as illustrated in Figure 3.3(c). In erase mode, all of the
cells thar share the source-line or the p-well are selected at the same time.
All the word-lines are set to about -7V, and the source-line for source
erasure [11] or the p-well for channel erasure [13] is set to about 7-10V.
Fowler-Nordheim tunneling current flows and electrons stored in the
floating gates are ejected to the source-line or p-well. Thus, the threshold
56 Low Power Memory Design

voltages of the erased cells are lowered. The sense amp senses the memory
cell data by means of the cell current. When the cell current flows through
the accessed memory cell, the sense amp outputs the data 1. On the other
hand, when the NOR cell turns off at the word-line voltage of about 5V, the
sense amp outputs the data 0. The program and erase voltages are hardly
scaled with the supply voltage. This is because the reliability of flash
memory cells strongly depends on the tunnel oxide thickness and the inter-
poly dielectric thickness, and thus this limits the thickness [15]. As will be
discussed in the charge pump section, the power efficiency for charge pump
circuits, which generate voltages higher than the supply voltage on chip for
reading, programming, and erasing data stored in the flash memories,
decreases in accordance with the lowering of the supply voltage under the
condition of the constant cell operation voltages. Therefore, it is very
important to reduce the operation voltages and load currents from the
viewpoint of power. Several low-power design techniques for NOR flash
memories are reviewed below.

Bit-by-bit weak program


The word-line voltage for read operation is determined by the trade-off
between large cell current for fast-read operation and high reliability against
read disturb [14]. The cell current is fixed by the gate overdrive. In order to
reduce the word-line voltage for read operation with sufficient cell current,
the upper limit of the erased threshold voltage distribution is required to be
low even for wide erased threshold voltage distribution. Therefore, bit-by-bit
weak program operation was introduced to low-voltage NOR flash
memories, as shown in Figure 3.4 [13].
After the erase-verify operation passes, the over-erase verify operation
starts. When a cell in the over-erased state is detected, the cell is weakly
programmed with the word-line voltage, which is lower than the normal
programming voltage. After these operations are repeated for all of the
memory cells in the block, tight threshold voltage distribution is achieved.
As a result, the word-line voltage for read operation can be reduced and
thereby the power consumption for generating the word-line voltage can be
also reduced. The decrease in the upper limit of the erased threshold voltage
distribution leads to the decrease in the lower limit of the programmed
threshold voltage distribution under the condition of the constant signal
margin for read operation. This also results in reduction in the word-line
voltage for program operation and thereby the power consumption for
generating the program word-line voltage can be also reduced.
Flash Memories 57

Negative-gate source- and channel-erase


Originally, an external high-rewriting voltage of 12V is directly applied to
the source terminals of the memory cells with the control gates grounded in
source-erase operation [5]. From 5V-only NOR flash memory generation, a
negative-gate source-erase scheme was introduced for reducing the erase
voltages (by 3-5V) and load currents due to band-to-band tunneling [11].
Negative-gate channel-erase technology was developed for NOR flash
memories from a 0.25um technology node with a supply voltage of 1.8V in
order to scale the cell size and drastically reduce the load currents [13].
Divided programming scheme
In 5V power supply generations, the bit-line voltage for programming is
directly supplied from the power supply. A maximum of sixteen cells are
programmed simultaneously in a selected word. From 3V-only flash
memory generation, a program bit-line charge pump circuit was introduced
to supply a program voltage of about 5V with a sufficient drain current,
about 0.3-0.5mA per cell. In order to suppress the charge pump circuit area,
the number of simultaneously programmed cells is decreased to a maximum
of four bits, for example. However, this results in a long programming time.
A divided program scheme detects the number of bits to be programmed and
58 Low Power Memory Design

optimizes the number of sequential program pulses, resulting in a reduction


in the average programming time [12]. By using this technique, peak current
can be also drastically reduced.
Substrate-current-induced CHE programming with ramped gate
program pulse
The drain current of the memory cell in program operation directly affects
the power in program operation. Therefore, the program efficiency, which is
defined by a ratio of the gate current to the drain current, is an important
parameter in NOR flash memory cells. For this reason, many studies for high
program efficiency have been performed concerning drain engineering and
memory cell operation conditions. In particular, the substrate-current-
induced channel hot-electron (CHE) programming technique was found to
be effective for high-program efficiency and thereby low power [16]. This
technique requires negative substrate bias generation so that the
programming time is prolonged. Page programming is suitable for this
technique in order to make the overhead relatively small. To further improve
the efficiency, ramped gate programming employed with substrate-current-
induced channel hot-electron programming technique was proposed to
maximize the program throughput under the condition of a given output
current of the drain pump [17].
No-boosted read operation for split-gate type NOR flash memories
Because the word-line voltage level for read drastically affects the power,
the bit-by-bit weak program method was introduced for reducing the word-
line voltage level for read in NOR flash memories. For further reduction, cell
transistors need to be changed to split-gate [18] cell transistors, or two [19]
or three [20] transistor memory cells, as illustrated in Figure 3.5. Even
though the erased threshold voltage of the floating-gate transistor is negative,
the threshold voltage of the memory cell is determined by that of the select
transistor. Therefore, the word-line voltage for read can be decreased down
to the supply voltage, resulting in drastic reduction in power. A disadvantage
of split-gate NOR flash memories when compared to standard NOR flash
memories is the large cell size, 1.5-2.5X of the standard NOR flash memory
cell. Thus, split-gate cells are utilized as embedded non-volatile memories
with small microcontroller capacity [19][20].
Source-side injection
Another advantage of split-gate type NOR flash memories is high program
efficiency. The control gate voltage is controlled so as to weakly invert the
channel of the select transistor. Most of the hot-electrons generated at the
source side have momentum in the direction of the floating gate. Thus, split-
Flash Memories 59

gate flash memories can have much higher program efficiency than standard
NOR flash memories [18].

3.2.1.2 NAND Flash Memory

Basic Memory Cell Operation


Figure 3.6(c) shows a typical bias condition of the NAND flash memory
cells. Fowler-Nordheim tunneling mechanisms can program and erase
NAND flash cells. Because the program efficiency for NAND flash cells is
approximately one, the number of simultaneously programmed bits can be
increased to a 512-2k byte page [22]. Figure 3.6(b) shows the threshold
voltage distribution of the NAND flash memory. Unselected cell transistors
in read operation have to turn on. Thus, the programmed threshold voltage
distribution is tightened by a bit-by-bit program verify operation [21].
60 Low Power Memory Design
Flash Memories 61

Self-boosted Programming Technique


Originally, the bit-line voltage for a program-inhibit of about 8V is
generated on chip and is transferred to the channel of the unselected cell
transistor, resulting in no programming operation, as illustrated in Figure
3.7(a) [22]. However, this operation dissipates much power. In order to
dispense with the requirement for the bit-line pump, a self-boosted
programming technique was introduced for low power NAND flash
memories [23]. In this technique, the bit-line voltage for program-inhibit is
the supply voltage. The channel of the unselected cell transistor is initially
precharged to Vdd-Vt, where Vdd is the supply voltage, and Vt is the
threshold voltage of the select transistor. The selected word-line voltage then
rises up to about 20V so that the self-boosted channel voltage rises up to
about 8V because the select transistor is cut off (Figure 3.7 (b)). A
disadvantage of this method compared with the fixed bit-line bias method is
that the number of program operations repeated to the same word-line is
limited to less than ten [24]. In low-voltage NAND flash memories, these
two methods should be combined to realize both low voltage and a sufficient
number of repeated program operations. One of the solutions is the source-
programming technique reviewed in the next section.
62 Low Power Memory Design

Source-line Programming Operation


The initial channel voltage of the program-inhibit cell in program operation
depends on the supply voltage. Thus, the self-boosted voltage decreases as
the supply voltage decreases, resulting in fewer repeated program operations.
In order to overcome the reduction in the number of program operations, the
source-line programming technique was developed, as illustrated in Figure
3.8[25]. A boosted voltage Vbst of about 4.5V (as high as a read voltage) is
applied to the common source-line, and all the channels of the cells
connected with the selected word-line are precharged to about 4.5V, as
illustrated in Figure 3.8(a). The source select transistors are then cut off, and
the drain select transistors are turned on with a select gate voltage of 0.7V.
When the programming data is 0, the bit-line voltage is set to 0V. Thus, the
channel voltage is 0V. On the other hand, when the programming data is 1,
the bit-line voltage is set to 0.5V. The drain select transistor is in cut-off
condition so that the channel voltage remains at the precharged level. After
that, self-boosted programming operation starts, as shown in Figure 3.8(b).
The total load capacitance for the common source-line is much smaller than
that for the 512-2k bit-lines. As a result, this scheme realizes both low-
voltage low-power operation and a sufficient number of repeated program
operations.
Flash Memories 63

Shielded Bit-line Read Operation


In accordance with the advance of lithographic technology, the capacitance
between neighbour bit-lines separated by the minimum space becomes the
dominant factor in the total bit-line capacitance. Read operation as well as
program operation is performed with a unit of page. When all the bit-lines
are simultaneously selected for reading, the bit-lines are precharged up to a
high voltage level for sufficient noise margin even though the voltage of the
0-data bit-line decreases due to coupling noise (Figure 3.9(a)). This results
in high power in the read operation even for low-voltage NAND flash
memories.
Shielded bit-line scheme [26] was proposed for a high-speed low-power
read operation (Figure 3.9(b)). In this scheme, bit-lines with even and odd
numbers are assigned to different pages. When the even page is selected for
read, bit-lines with odd numbers are grounded with RSTo high. The noise
from the other bit-lines can be reduced by a factor of twenty. Therefore, the
precharged voltage level can be decreased by a factor of two so that the
power can be also reduced to about half.

3.2.2 Circuits Used in Flash Memories


64 Low Power Memory Design

In this section, low-voltage circuit design for high-performance flash


memories is reviewed.

3.2.2.1 Charge Pump Circuits

Optimization
Figure 3.10(a) shows the Dickson charge pump circuit [28]. The equivalent
circuit is illustrated in Figure 3.10(b) and presents the dynamic characteristic
[29]. In order to design the charge pump circuit, it is necessary to have an
optimization theory that takes into consideration the dynamics of the circuit
to accelerate the rise time of the output voltage even at a low supply voltage
[29]. Another optimization is required to obtain the maximum output current
for a given circuit area [30]. The former optimization is applied to the
programming word-line charge pump and the erase voltage generating
charge pump for both NOR and NAND flash memories and the latter is
Flash Memories 65

applied to the word-line pump for reading and the bit-line pump for
programming in NOR flash memories. Figure 3.10(c) summarizes the
optimized number of stages and the required capacitance per stage for a
required rise time for the former optimization and for a required output
current for the latter optimization.
Figure 3.11 shows the dependence of the rise time, current consumption,
and power on the boosted voltage and supply voltage for the charge pump
circuit [29]. The power is proportional to the number of stages, which, in
turn, is inversely proportional to Vdd-Vt, where Vt is the threshold voltage
of the transfer transistor. Therefore, the threshold voltage degrades the
power efficiency of the charge pump circuit for low-voltage flash memories.
Low voltage charge pump
In order to reduce the effective threshold voltage of the transfer gate, a four-
phase clock pump (Figure 3.12) [31] and floating-well pump [32] have been
developed. In the four-phase clock pump, gate overdrive can be realized so
that the transfer gate operates in linear region. In the floating-well pump, the
well of the transfer transistors is in floating state, so that the transfer
transistors do not suffer from the body effect. A disadvantage of these charge
pump circuits is the low clock frequency compared with that of the
conventional two-phase clock Dickson pump.
66 Low Power Memory Design

Pool method for read voltage generation and low standby-power system
For a 3V-only NOR flash memory, a capacitor-switched booster circuit or
kicker scheme was used to generate a read voltage higher than the supply
voltage by switching the connection state of one or more boosting capacitors
with the load capacitor from parallel to series, synchronized with the address
transition detection (ATD) signal, as illustrated in the first column of Figure
3.13 [33].

For sub-2V or lower supply voltage flash memories, the Dickson charge
pump circuit is better than the capacitor-switched booster circuit from the
viewpoint of active power [30]. The only disadvantage of the Dickson pump
compared with the capacitor-switched booster is the high standby current. In
order to reduce the standby current down to less than the detection
current and the operation current in the detector have to be of the order of
100nA, even for a Dickson pump. Under such low current conditions, the
conventional negative feedback system cannot be used because the operation
speed in the detector is so slow that the word-line voltage cannot be stably
controlled. A low standby-power system was developed as illustrated in the
third column of Figure 3.13. The standby pump operates until the counter
counts a previously determined number so as not to overshoot the word-line
voltage with a very low standby current of [35].
Flash Memories 67

Area reduction scheme


Area for charge pump circuits in a low-voltage flash memory drastically
increases with approximately for a constant rise time [29]. In the
NAND flash case, four kinds of charge pump circuits are required: a 4.5V-
pump for unselected word-lines in read operation, a 10V-pump for
unselected word-lines in program operation, and two 18V-pumps for a P-
well erase operation and for selected word-line in program operation. In
conventional NAND flash memory, these charge pump circuits are
separated. As a result, the area efficiency of charge pump circuits has been
significantly low. An area reduction scheme is merged into one replacing
three kinds of charge pumps and operates with individually optimized
efficiency for different operational modes [34]. For simplicity, Figure 3.14
shows the charge pump circuit, which outputs 4.5V with two pumps in read
operation and 10V with a single pump in program operation. As a result, the
area reduction scheme reduces the area required for charge pump circuits in
NAND flash memories by 40% [34].
68 Low Power Memory Design

3.2.2.2 Level Shifter

Level shifters are required to transform a signal with an amplitude of the


supply voltage to the signal with an amplitude of the read, program, or erase
voltage. CMOS high/low-level shifters for NOR flash memories and an
NMOS high-level shifter for NAND flash memories (Figure 3.15) are
reviewed for low-voltage low-power operation.

Low-voltage high/low-level shifters for NOR flash


In comparison with NAND flash memories, NOR flash memories are
required to have much higher read performance with an access time of less
than 100ns. This demand requires CMOS high-voltage transistors. As a
result, NOR flash requires CMOS high- and low-level shifters. Figure 3.16
illustrates the conventional and low-voltage high-level shifters [35]. As
shown in the figure, the conventional high-level shifter cannot operate at
1.5V or below because the supply voltage is close to the threshold voltage of
the enhancement NMOS transistor. In order to reduce the minimum
operating supply voltage, low-Vt transistors are utilized in the level shifter.
The leakage current in the low-Vt transistor to be cut off does not flow due
to the negative gate-to-source voltage. The low-voltage level shifter realizes
both the low-voltage operation and low power.
Figure 3.17 illustrates conventional and low-voltage CMOS low-level
shifters [36]. The conventional level shifter marginally operates at 2.5V
Flash Memories 69
70 Low Power Memory Design

because of the small current flowing through PMOS transistors due to drastic
reduction in gate overdrive, whereas the low-voltage low-level shifter can
operate even at 1V. The low-voltage low-level shifter is composed of three
parts; the latch holding the negative erasing voltage, two coupling capacitors
connected with the latched nodes, and the drivers inverting the latch. The
drivers can have sufficient driving currents to invert the latch via the
coupling capacitors. Other types of level shifters were proposed in [37].

Low-voltage NMOS level shifter for NAND flash


Figure 3.18 illustrates a conventional NMOS high-level shifter composed of
only high-voltage NMOS transistors used in NAND flash memories. Unlike
CMOS switches with a large parasitic capacitance of N-well for PMOS
transistors, this NMOS high-level shifter has small gate-, junction- and
wiring-capacitance, resulting in low-power consumption and short rise time.
However, this switch also has a disadvantage in that the minimum operating
supply voltage is mainly limited by the threshold voltage of the enhancement
Flash Memories 71

transistor, which prevents the leakage current from flowing in the level
shifter during the inactive state. A diode-connected intrinsic transistor
without channel implantation is used to improve the positive-feedback
efficiency of the local booster when selected for operation. However, the
conventional level shifter cannot operate at a supply voltage below 2.4V, as
shown in the figure. The figure also illustrates a low-voltage NMOS high-
level shifter composed of only low-Vt high-voltage transistors, which
virtually eliminates leakage current from the boosted voltage and which
operates even at a supply voltage of 1.4V [27].

3.2.2.3 Sense Amplifier

Low-voltage sense amp for NOR flash memory


The sense amp illustrated in Figure 3.19 has been utilized for several
generations with a supply voltage equal to or higher than 3V [42]. For sub-
2V NOR flash memory, three types of low-voltage sense amps have been
72 Low Power Memory Design

proposed [37][38][39]. The common idea is that the diode-connected load


transistors are removed from the critical path to lower the minimum
operating supply voltage. The design method for the clamp load transistors
in the low-voltage sense amp illustrated in Figure 3.19 is very important for
low-voltage high-speed operation [38].

Low-voltage sense amp for NAND flash


NAND flash memories have a page-oriented read access capability suitable
for a small cell current of about 1uA. All of the bit-lines are precharged to
Vdd, and the word lines in a selected block then rise to flow cell current
through the NAND string. After a time period of 5-10us, the bit-line voltage
is sensed with SEN high in the conventional sense amp [40]. When the bit-
line voltage is higher than that of the threshold voltage of the sensing NMOS
transistor, the latch is compulsorily inverted to hold the cell data of 0. On the
other hand, the bit-lines are precharged to a first-clamped level by a
PRG/CLAMP transistor in order to reduce the power in the low-voltage
sense amp. After the NAND strings pass the cell current, the gate voltage of
the clamp transistor is set to a second clamped level lower than the first
level. The capacitance of the sense node is sufficiently light in comparison
with that of the bit-line. Thus the voltage at the sense node is as low as that
Flash Memories 73

of the bit-line when the bit-line is discharged by the cell current in the case
of 1-data. Because the bit-line keeps the precharged level in the case of 0-
data, the clamp transistor is in an off-state, resulting in the voltage at the
sense node of In this sense amp, the voltage swing of the bit-line can be
reduced so that the fast random access of can be achieved [41].

3.2.2.4 Effect of the Supply Voltage Reduction on Power

As shown in the charge pump section, unless the internal high voltages for
read, program, and erase operations are scaled with the supply voltage, the
power in the power system would increase (instead of decrease) due to
degradation in power efficiency. On the other hand, the power dissipated in
the low-voltage logic gates and the internal and external buses can be
reduced by a factor of The dominant factor of the power in the
program and erase operations is the former whereas that in the read
operation is the latter. Therefore, the effectiveness of the supply voltage
reduction on the power reduction depends on the duty factor of read cycles
to rewrite cycles. As shown in Figure 3.21, the power for both NOR and
NAND flash memories can be reduced for the standard use of the NOR and
NAND flash memories with a read duty of greater than 50% [27].
74 Low Power Memory Design

3.3 FERROELECTRIC MEMORY

3.3.1 Basic Operation of FeRAM

Ferroelectric material and its application to semiconductor memory devices


have a long history back to an experiment done by Tarui and Moll [43]. The
access transitor and capacitor type ferroelectric RAM or FeRAM have been
intensively investigated with the introduction of PZT film [44][45][46]. A
ferroelectric memory cell consists of one access transistor and one capacitor.
The cell topology is similar to a DRAM cell. The charge stored in a
paraelectric linear capacitor is the information in a DRAM cell; however,
information is stored as a polarization in a ferroelectric film.

Major ferroelectric materials have a perovskite structure. Figure 3.22


shows an example, a crystal molecular of PZT and its deformation by
applied electric field. Ti or Zr is surrounded by Pb and O atoms. There are 2
Ferroelectric memory 75

stable states for positions of Ti/Zr, upper site or lower site. The position of
the central atom can be flipped by applying an external electric field, and the
position is unchanged even when the applied electric field is turned off.
Information data is stored as a non-volatile remnant polarization. As shown
in Figure 3.23, polarization has a hysteresis, and there are 2 stable states “0”
and “1”at the applied voltage of 0. As those two states are “stable,” even
when the access transistor is turned on, no free charge is read. In order to
read the stored polarization in the film, application of the voltage to the
capacitor is necessary in contrast to a DRAM read operation.
Figure 3.24 shows the read operation of FeRAM. In the read operation,
(1) the bit-line is floating and precharged to 0V. Then (2) the word line is
turned on and (3) the plate line is driven to to apply voltage to the
ferroelectric capacitor. Just after the plate line drive, the voltage across the
capacitor is – Then charges are shared by the ferroelectric capacitor and
the bit-line. Therefore the read out signal can be obtained as the intersection
between the ferroelectric capacitor’s hysteresis curve and the linear
capacitance with the negative slope of the bit-line capacitance. If the data is
“1,” the read-out voltage is higher than that for “0.” In 2T-2C mode, which
stores 1 bit using a pair of 1T-1C cells connected to complimentary bit-lines
as shown in Figure 3.25, the signal is the difference between the “0” and
“1.” In 1T-1C mode 1 bit is stored in the 1T-1C cell, so the reference voltage
should be supplied in-between the “0” and “1” and accordingly the signal is
half or less than the 2T-2C mode. As shown in Figure 3.24, the stored “1”
data is destroyed in the read cycle; therefore a restore operation is necessary.
76 Low Power Memory Design

Figure 3.26 shows the restore and write operation. When the
read operation is finished, the plate line is activated; therefore the “0” data
can be restored to the cell. However, for the “1” cell, the voltage applied to
the cell capacitor is 0V, because plate line level is “H” and the bit-line (Cell
capacitor node) is also “H”. Therefore in order to restore “1” data, the plate
line should be pulled down. Thus in FeRAM restore cycle, “0” data restore
and “1” data restore are done with separate timing. The write operation is
similar to restore. In the write operation data on the bit-lines are forced by
the write buffers via data bus.
Ferroelectric memory 77

3.3.2 Low Voltage FeRAM Design

3.3.2.1 Optimization of Bit-line Capacitance

For stable operation at low supply voltage, it is essential to get a larger read-
out signal. As described in the previous session, the read-out signal depends
on the bit-line capacitance and hysteresis curve of the cell capacitor. In
DRAM where the cell capacitor has linear capacitor characteristics, the read-
out signal linearly decreases with the bit-line capacitance. Therefore, the
smaller bit-line capacitance is always better from the viewpoint of the signal
magnitude. However, with the non-linear hysteresis characteristics of the
ferroelectric capacitor, there is an optimal bit-line capacitance and cell
capacitance ratio. Figure 3.27 shows an example of the bit-line capacitance
dependency on the read-out signal. The Read-out signal has a peak value
around a bit-line capacitance of 150fF in this particular case. As shown, it is
important to optimize the bit-line capacitance in the low voltage FeRAM
design.

3.3.2.2 Cell Plate Line Drive Techniques

Theoretically, the entire cell plate can be driven at the same time. This
common plate scheme is very area efficient, but there are two problems.
First, the cell plates of unselected cells are also driven. Then the voltage of
the storage node of the unselected cells is boosted, but, due to the parasitic
capacitances of the storage node, the voltage cannot be as high as the cell
plate voltage. So the unselected cells experience a disturbance voltage across
78 Low Power Memory Design

the ferroelectric capacitor. This may cause the degradation of stored


polarization. Second, the cell plate capacitance becomes high, making the
cell plate drive very slow. The capacitance is huge due to high dielectric
constant capacitors connected in parallel. To reduce the cell plate line
capacitance, a divided cell plate line is proposed [47]. If cell plates are
further divided and each cell has a word line and plate line, the disturbance
problem can be solved. However, this scheme reduces cell area efficiency,
and the high resistance of the divided narrow cell plate affects the cell plate
drive time. This scheme also requires many cell plate line drivers, which
further reduces the area efficiency. Bit line parallel Chain FeRAM was
proposed to solve those problems [48][49]. In the Chain FeRAM, a unit cell
is composed of one access transistor and one ferroelectric capacitor in
parallel, which looks like a ring. Unit cells are also connected in series,
making a chain block with the block select transistor. Figure 3.29 shows the
basic operation of Chain FeRAM. In the standby cycle, all of the word lines
are pulled up and turned on, and all of the cell capacitors are biased to 0V.
Therefore, all cell data are safely stored. In the active cycle, the block select
transistor is turned on, and the word line of the selected cell is pulled down
and cut off. Then the cell plate line is driven to Only the target cell
capacitor is biased to and the cell data is read out to the bit line.
In Chain FeRAM, cell-plate line is shared by plural cells, and the cell
area efficiency becomes much larger. Because the cell plate line is shared,
the line can be wide and low resistance without sacrificing area, and plate
line driver can be easily enlarged to have enough driveability.
Ferroelectric memory 79

3.3.2.3 Non-driven Cell Plate Line Scheme

As shown in the previous section, the speed of a FeRAM with a driven cell
plate is nearly determined by the cell plate drive time. The non-driven cell
plate line scheme, which is similar to the DRAM operation, was proposed in
[50]. The cell plate is set to a stable 1/2 as shown in Figure 3.31. The
Bit line is precharged to 0V. In the read operation, when word line of the
selected cell is turned on, 1/2 Vdd bias is applied to the target cell, and the
signal can be read out to the bit line. A draw back of the non-driven cell
80 Low Power Memory Design

plate line scheme is that it requires refresh cycles. As shown in Figure 3.32,
word lines of the unselected cells are cut off, then the storage node of the
unselected cells is discharged to p-well bias “0V” by the leakage current of
the diffusive layer. A plate line voltage of 1/2 is the applied to the cell
capacitors, which degrades the cell data. So just like DRAM cells, these cells
need to be refreshed to 1/2 at some intervals. This reduces FeRAM’s
inherent advantage of non-volatility.
Ferroelectric memory 81

Short-circuiting the capacitor electrodes of unselected cells is an effective


way to avoid the refresh cycle. Figure 3.33 shows the DeFeRAM proposed
by Infineon [51]. Recall that depletion type transistors are on when their gate
voltages are at 0V. Normally all of the cell capacitors are short-circuited. In
the access mode the word line of the access transistor of the target cell is
turned on, and, simultaneously, the depletion transistor of the target cell is
switched off by applying a negative bias to the gate.
In the Chain FeRAM shown in Figure 3.29, all of the capacitors are
short-circuited. Therefore, the Chain FeRAM can adopt the non-driven cell
plate scheme without introducing any more devices.

3.3.2.4 Other Low Voltage Techniques

The reliability of FeRAM depends on the level of polarization. So it is


important to apply high enough voltage to saturate the polarization. On the
other hand, power supply voltage should be lowered to reduce the power
consumption of the logic section of the chip. In such embedded FeRAM
82 Low Power Memory Design

devices, memory array voltage needs to be boosted. In order to apply a high


enough voltage without sacrificing much power consumption, a hierarchy bit
line boost scheme is proposed [52]. In this scheme, bit lines are divided to
sub bit lines, which are connected to the main bit line via separation
transistors. Only accessed sub bit lines are boosted to avoid wasting charges.
The other scheme to make the most of array voltage is shown in Figure
3.34. As shown in Figure 3.27 and the inset of Figure 3.34, bit line voltage
is boosted by the cell capacitor when the plate line is driven. Thus the actual
voltage applied across the capacitor is less than the cell plate line voltage. In
the ferrocapacitor overdrive method [49], when the plate line is driven, an
additional capacitor is pulled down the bit line in order to apply larger bias
across the cell capacitor. In other words, by using this method one can
squeeze out more polarization to free charge so that the read-out signal
becomes larger. This method is effective when the bit line capacitance is
small and the array voltage is low.

3.4 EMBEDDED DRAM

3.4.1 Advantages of Embedded DRAM

Combining of DRAM and logic gates in one chip has been common since
the 1980s. However, the first realistic demonstration was Toshiba’s 72K gate
array with 1 Mbit DRAM [53]. Since then intensive works have been
Embedded DRAM 83

performed [54][55][56]. The advantages of embedded DRAMs are the


ability to (1) achieve high data band width between logic gates and memory,
(2) reduce I/O power consumption, (3) reduce EMI, and (4) reduce the
number of parts in a system. Figure 3.13 illustrates the data transfer rate and
I/O power consumption. High performance systems require high data
throughput memory. When realizing high data throughput by employing a
wide-bit I/O chip configuration and high clock frequency in a stand-alone
DRAM, I/O drive power can easily become as large as 1 W. Eliminating the
large buffer, bonding pads, and outside buses reduces the I/O power
drastically. Clock rate can be easily enhanced for higher data throughput
compared with stand-alone DRAMs.
Therefore, the current major applications of eDRAM are high-speed
graphics and low power mobile MPEG chips [57][58].

3.4.2 Low Voltage Embedded DRAM Design

Reducting the operating voltage of logic LSI with embedded DRAM is


important, especially when one needs to reduce power consumption and
maintain the device reliability of logic gate parts. Low voltage operation,
however, is a big challenge for DRAM macro parts. In stand-alone DRAMs,
operation voltage is around 1.8V or higher, and a half- bit line
precharge scheme is dominant [1]. Half- bit line precharge schemes
84 Low Power Memory Design

have a low power advantage over or precharge schemes because the


precharge level can be generated by charge-sharing between complimentary
bit lines. However, using the half- precharge scheme, only a half-Vdd is
applied to thedynamic flip-flop type sense amplifier. Thus the current
driveability of the sense amplifier is too low to maintain high-speed
operation.

bit line precharge scheme is utilized in embedded DRAM to support


lower operation voltage by providing more overdrive during the sense
operation. Furthermore, precharge is more favorable for high-speed
operation than precharge because in the signal development stage, the
Summary 85

access transistor can be turned on earlier than when “H” data is stored in the
cell [59].
The hybrid precharge [60] scheme was proposed from the power-
aware design point of view. Figure 3.37 shows the proposed array and
operation waveforms. Sense amplifiers are divided to two groups; one is
precharged to and the other is precharged to In the precharge cycle,
before fully precharging bit lines, the two groups of sense amplifiers are
connected so as to not waste charges. Then the two groups of sense
amplifiers are disconnected and are fully precharged to and
respectively.

Figure 3.38 compares the sensing time and bit line charging and
discharging power consumption of and hybrid
precharge scheme. As is seen in the figure, at low operation the voltage in
the halfVdd precharge scheme is marginal and that in the Vdd/Vss hybrid
precharge scheme is favourable from the view point of power consumption
when compared with the precharge scheme.

3.5 SUMMARY

New low-power applications in Internet/Wireless era require the


development of non-volatile memories and embedded high density
memories that have low power and high performance features
simultaneously. Lowering operating voltage is the most effective way of
reducing the power, while it is a big challenge for stable and high speed
memory operations. This chapter described numbers of low voltage design
86 Low Power Memory Design

techniques. The operation voltage has almost reached to 1.5V. For further
operation voltage reduction, there are a lot of challenges including device
physics limitations such as threshold voltage limitation [61], and gate
dielectrics tunnelling current limitation. There is quite a lot of work to be
done in this area.

REFERENCES
[1] J. M. Rabaey and M. Pedram, “Low power design methodologies,” Kluwer Academic
Publishers, pp201-251, 1996.
[2] K. Itoh etal,“Trends in Low-Power RAM Circuit Technologies,” in Proc IEEE., vol83 pp.
524-543, 1995.
[3] D. D. Buss,“Technology in the Internet Age,” ISSCC Digest of Technical Papers, pp.18-
19, 2002.
[4] F. Masuoka et al., “A new flash EEPROM cell using triple polysilicon technology,”
Technical Digest of IEDM, pp. 464-7,1984.
[5] V. N. Kynett et al., “An in-system reprogrammable 256K CMOS Flash memory,” ISSCC
Digest of Technical papers, pp. 132-3, 1988.
[6] F. Masuoka et al.,“New ultra high density EPROM and flash EEPROM cell with NAND
structure cell,” Technical Digest of IEDM, pp. 552-5, 1987.
[7] H. Onoda et al., “A novel cell structure suitable for a 3V operation, sector erase Flash
memory,” Technical Digest of IEDM, pp. 599-602, 1992.
[8] H. Kume et al., “A contactless memory cell technology for a 3V only 64Mb
EEPROM,” Technical Digest of IEDM, pp. 991-3,1992.
[9] S. Mehrotra et al., “Serial 9Mb Flash EEPROM for solid state disk applications,”
Symposium on VLSI Circuits, pp. 24-5, 1992.
[10] S. Kianian et al., “A novel 3 volts-only, small sector erase, high density Flash
Symposium on VLSI Technology, pp. 71-2,1994.
[11] S. D'Arrigo et al., “A 5 V-only 256K bit CMOS Flash EEPROM,” ISSCC Digest of
Technical papers, pp. 132-3, 1989.
[12] J. C. Chen et al., “A 2.7V only 8Mb x16 NOR Flash Memory,” Symposium on VLSI
Circuits, pp. 172-3, 1996.
[13] S. Atsumi et al., “A Channel-Erasing 1.8V Only 32Mb NOR Flash EEPROM with a Bit-
Line Direct Sensing Scheme,” ISSCC Digest of Technical Papers, pp. 814-5, 2000.
[14] A. Brand et al., “Novel read disturb failure mechanism induced by FLASH cycling,”
Reliability Physics Symposium. 1993. 31st International Annual Proceedings, pp. 127-32,
1993.
[15] K. Naruke et al., “Stress induced leakage current limiting to scale down EEPROM tunnel
oxide thickness,” Technical Digest of IEDM, pp. 424-7, 1988.
[16] J. D. Bude et al., “Secondary Electron flash-a high performance, low power flash
technology for 0.35um and below,” IEDM Technical Digest, pp. 279-82, 1997.
[17] D. Esseni et al., “Trading-off programming speed and current absorption in flash
memories with the ramped-gate programming technique,” IEEE Transactions on Electron
Devices, vol. 47, no. 4, pp. 828 –834, Apr. 2000.
[18] A. T. Wu et al., “A source-side injection erasable programmable read-only-memory (SI-
EPROM) device,” IEEE Electron Device Letter, vol.EDL-7, no.9 pp.540-2, Sep. 1986.
Summary 87

[19] Y. Okuda et al., “A 0.9 V operation 2-transistor flash memory for embedded logic LSIs,”
Digest of Technical Papers, Symposium on VLSI Technology, pp. 21-2, Jun. 1999.
[20] T. Ikehashi et al., “A 60 ns access 32 kByte 3-transistor flash for low power embedded
applications,” Digest of Technical Papers of symposium on VLSI circuits, pp. 162-5, Jun.
2000.
[21] T. Tanaka et al., “A quick intelligent program architecture for 3 V-only NAND-
EEPROMs,” Symposium on VLSI Circuits, Digest of Technical Papers, pp.20-1, 1992.
[22] K. Imamiya et al., “A 35ns-Cycle-Time 3.3V-Only 32Mb NAND Flash EEPROM,”
ISSCC Digest of Technical Papers, pp. 130-1, 1995.
[23] K. D. Suh et al., “A 3.3V 32Mb NAND Flash Memory with Incremental Step Pulse
Programming Scheme,” ISSCC Digest of Technical Papers, pp. 128-9, 1995.
[24] S. Satoh et al., “A novel isolation-scaling technology for NAND EEPROMs with the
minimized program disturbance,” Technical Digest of International Electron Devices
Meeting, pp. 291-4, 1997.
[25] K. Takeuchi et al., “A source-line programming scheme for low-voltage operation NAND
flash memories,” IEEE Journal of Solid-State Circuits, Vol. 35, No. 5, pp. 672-81, May
2000.
[26] T. Tanaka et al., “A quick intelligent pp.-programming architecture and a shielded bitline
sensing method for 3 V-only NAND flash memory,” IEEE J. Solid-State Circuits, vol.29,
no.11, pp. 1366-73, Nov. 1994.
[27] T. Tanzawa, “Low-voltage circuit design for high-performance flash memories,” IEEE J.
Solid-State Circuits, to be published
[28] J.F.Dickson, “On-chip high-voltage generation in mnos integrated circuits using an
improved voltage multiplier technique,” IEEE J. Solid-State Circuits, Vol.SC-11, No.3,
pp374-378, Jun. 1976.
[29] T. Tanzawa and T. Tanaka, “A dynamic analysis of the dickson charge pump circuit,”
IEEE J. Solid-State Circuits, Vol.32, No.8, pp.1231-40, Aug., 1997.
[30] T. Tanzawa and S. Atsumi, “Optimization of word-line booster circuits for low-voltage
flash memories,” IEEE J. Solid-State Circuits, Vol.34, No.8, pp.1091-8, Aug., 1999.
[31] A.Umezawa et al., “A 5V-only operation 0.6um Flash EEPROM with row decoder
scheme in triple-well structure,” IEEE J. Solid-State Circuits, Vol.27, No.11, pp.1540-
1546, Nov.l992.
[32] K.Sawada et al., “An on-chip high-voltage generator circuit for EEPROMs with a power
supply voltage below 2V,” 1995 Symposium on VLSI Circuits Digest of Technical Papers,
pp.75-76, Jun.1995.
[33] Y. Miyawaki et al., “A new erasing and row decoding scheme for low supply voltage
operation 16-Mb/64-Mb Flash Memories,” IEEE J. Solid-State Circuits, Vol.27, No.4,
pp.583-8, Apr., 1992.
[34] T. Tanzawa et al., “Circuit techniques for a 1.8V-Only NAND flash memory,” IEEE
Journal of Solid-State Circuits, Vol. 37, No. 1, pp. 84-9, Jan., 2002.
[35] T. Tanzawa et al., “Word-Line Voltage Generating System for Low-Power Low-Voltage
Flash Memories,” IEEE Journal of Solid-State Circuits, Vol. 36, No. 1, pp. 55-63, Jan.
2001.
[36] T. Tanzawa et al., “high voltage transistor scaling circuit techniques for high-density
negative-gate channel-erasing NOR flash memories,” IEEE Non-Volatile Semiconductor
Memory Workshop, Digest of Technical Papers, Aug., 2001.
[37] N. Otsuka and M. Horowitz, “Circuit techniques for 1.5-V power supply flash memory,”
IEEE J. Solid-State Circuits, Vol.32, No.8, pp.1217-30, Aug., 1997.
88 Low Power Memory Design

[38] T. Tanzawa et al., “Design of a sense circuit for low-voltage flash memories,” IEEE
Journal of Solid-State Circuits, Vol. 35, No. 10, Oct 2000.
[39] B. Parthak et al., “A 1.8V 64Mb 100MHz flexible read while write flash memory,” ISSCC
Digest of Technical Papers, pp.32-3, Feb. 2001.
[40] H. Nakamura et al., “A novel sense amplifier for flexible voltage operation NAND flash
memories,” Digest of Technical Papers of Symposium on VLSI Circuits, pp. 71-2, Jun.
1995.
[41] K. Imamiya et al., “A 256Mb NAND flash with shallow trench isolation
technology,” ISSCC Digest of Technical Papers, pp. 112-3, Feb. 1999.
[42] S. Atsumi et al., “Fast programmable 256K read only memory with on-chip test circuits,”
IEEE J. Solid-State Circuits, Vol.SC-20, No.1, pp.422-7, Feb., 1985.
[43] J. L. Moll and Y. Tarui, “A new solid state memory resistor,” IEEE Trans. Electron
Devices, Vol ED-10, pp. 338-9, 1963
[44] J. T. Evans etal.,”“,” IEEE J. Solid-State Circuits, Vol. SC-23, No5, pp.1171-1175
[45] R. Womach etal., “A 16Kb ferroelectric nonvolatile memory with a bit parallel
architecture,” ISSCC Digest of Technical Papers, pp. 242-3, Feb., 1989
[46] S. S. Eaton etal., “A ferroelectric nonvolatile memory,” ISSCC Digest of Technical
Papers, pp. 130-1, Feb., 1988
[47] T. Sumi etal., “A 256Kb nonvolatile ferroelectric memory at 3V and 100ns,” ISSCC
Digest of Technical Papers, pp. 268-9, Feb., 1994
[48] D. Takashima etal, “A sub-40ns random access chain FRAM architecture with 7ns cell-
plate-line drive,” ISSCC Digest of Technical Papers, pp. 102-3, Feb., 1999
[49] D. Takashima etal, “A 76mm2 8Mb chain ferroelectric memory,” ISSCC Digest of
Technical Papers, pp. 40-1, Feb., 2001
[50] H. Koike etal, “A 60-ns 1-Mb nonvolatile ferroelectric memory with a nondriven cell
olate line write/read scheme,” IEEE Journal of Solid-State Circuits, vol. 31, pp. 1625 -
1634, Nov. 1996.
[51] G. Braun etal., “A robust 8f2 ferroelectric RAM cell with depletion device (DeFeRAM),”
IEEE Journal of Solid-State Circuits, vol. 35, pp. 691 - 700, May. 2000.
[52] H-B. Kang etal, “A hierachy bitline boost scheme for sub-1.5V operation and short
precharge time on high density FeRAM,” ISSCC Digest of Technical Papers, pp. 158-9,
Feb., 2002
[53] K. sawada etal, “A 72-K CMOS channelless gate array with embedded 1-Mbit dynamic
RAM,” CICC Digest, pp. 20.3.1-4, May, 1988
[54] S. Miyano etal., “A 1.6GB/S Data-Transfer-Rate 8-Mb embedded DRAM,” ISSCC Digest
of Technical Papers, pp. 300-1, Feb., 1995
[55] K. Itoh etal,. “Limitations and challenges of multigigabit DRAM chip design,” IEEE
Journal of Solid-State Circuits, vol. 32, pp. 624 - 634, May. 1997.
[56] T. Yabe etal., “A Configurable DRAM Macro design for 2112 derivative organization to
be synthesized using a Memory Generator,” ISSCC Digest of Technical Papers, pp. 72-3,
Feb., 1998
[57] T. Nishikawa etal., “A 60MHz 240mW MPEG-4 video-phone LSI wuth 16Mb embedded
DRAM,” ISSCC Digest of Technical Papers, pp. 130-1, Feb., 2000
[58] A. Kahn etal., “A 150MHz graphic rendering Processor with 256Mb embedded DRAM,”
ISSCC Digest of Technical Papers, pp. 150-151, Feb., 2001
[59] J. Barth etal., “A 300MHz multi-banked eDRAM Macro featuring GND sense, bit-line
twisting and direct reference cell write,” ISSCC Digest of Technical Papers, pp. 156-7,
Feb., 2002
Summary 89

[60] H. Nakano et al., “A dual layer bitline DRAM array with Vcc/Vss hybrid precharge for
multi-gigabit DRAMs,” Symposium on VLSI Circuits, pp. 190-1, 1996.
[61] Y. Oowaki et al., “A sub-0.1um circuit design with substrate-over-biasing,” ISSCC Digest
of Technical Papers, pp. 88-9, Feb., 1998
This page intentionally left blank
Chapter 4
Low-Power Digital Circuit Design

Tadahiro Kuroda
Keio University

Abstract: Circuit techniques for power-aware design are presented, including techniques
for a variable supply voltage, a variable threshold voltage, multiple supply
voltages, multiple threshold voltages, a low-voltage SRAM, a conditional flip-
flop, and an embedded DRAM.

Key words: Supply voltage, threshold voltage, variable, multiple, substrate bias, low-
voltage SRAM, conditional flip-flop, embedded DRAM.

4.1 INTRODUCTION

CMOS power dissipation has been increasing due to the increase in power
density due to device scaling [1]. Constant voltage scaling was employed
until the early 1990s when power density was rapidly increased by where
is a device scaling factor, resulting in increase in the power
dissipation by fourfold every three years. Recently, constant field scaling
has been applied to deal with the power problem. Power density is still
increased by leading to a doubling of the power dissipation every 6.5
years. It is assumed that the power dissipation in CMOS chips will increase
steadily as a natural result of device scaling.
Future computer and communications technology, on the other hand, will
require further reduction in power dissipation [2]. Ubiquitous computing is
the next generation in information technology where computers and
communications will be scaled further, merged together, and materialized in
consumer applications. Computers will be invisible behind broadband
networks as servers, while terminals will come closer to people, even as
wearable or implantable devices. IC chips will be implanted everywhere so
that things can think and talk for sophisticated human-computer interactions.
One of the key technologies need to reach this end is low-power technology.
92 Low-Power Digital Circuit Design

Since no new energy efficient device technology is on the horizon, low-


power CMOS design is the current challenge [3].
General guidelines for power reduction in CMOS are derived from its
power dissipation and speed equations. CMOS power dissipation, P, and
propagation delay time, are approximately given by

where is the switching probability, is the clock frequency, is the


load capacitance, is the power supply voltage, is the threshold
voltage, S is the subthreshold slope, represents the velocity saturation
effect (typically 1.4), and the other parameters are constants determined by
circuit, layout, and device parameters. The general guidelines for power
reduction derived from these equations are threefold: 1) lower 2) lower
and 3) reduce
This chapter presents popular circuit techniques for lowering these three
parameters as well as quantitative models to optimize them.

4.2 LOW VOLTAGE TECHNOLOGIES

Lowering is the most attractive choice due to the quadratic dependence.


However, as becomes lower, circuit delay increases and chip throughput
degrades. There are three different approaches used to maintain chip
throughput at low 1) utilize parallel and/or pipeline architectures to
compensate for the degraded circuit speed [4], 2) lower to recover the
circuit speed, and 3) employ multiple and for non-critical circuits.
The idea behind the first approach is that circuit can be slow with good
architecture. Silicon area is traded for power reduction. The idea in the
second approach is that circuit should be fast. This approach combined with
low increases subthreshold leakage current and, consequently, standby
power dissipation. In standby mode, should be raised. Furthermore, the
requirement for circuit speed in active mode often changes from time to
time. Consequently, variable and are essential. In the third
approach, some circuits should be fast and others can be slow. In other
words, this approach utilizes a timing surplus. Since speed requirement
differs spatially from circuit to circuit, multiple and are effective.
Low Voltage Technologies 93

Circuit design techniques for the second and third approaches, as well as
theoretical models for quantitative understanding will be discussed in detail.

4.2.1 Variable and

Figure 4.1 depicts equi-power (solid lines) and equi-speed (broken lines)
curves on the plane calculated by using equations (4.1) and (4.2) [1].
A rectangle in the figure illustrates ranges of change and fluctuation
that should be taken into account. This rectangle is a design window
because all the circuit specifications should be satisfied within the rectangle
for yield conscious design. In the design window, the circuit speed becomes
the slowest at the upper-left corner S, while at the lower-right corner P, the
power dissipation becomes the highest. The equi-speed and equi-power
curves are normalized at the corners S and P, as designated by normalized
factors and so that the amount of speed and power that must be
improved or degraded, compared to those in the typical condition can be
calculated by sliding and sizing the design window on the plane.
94 Low-Power Digital Circuit Design

When the design window is moved toward and along


the equi-speed curve, power dissipation is reduced. Since the subthreshold
leakage current increases rapidly as is lowered, the power dissipation will
be increased again at the point where the leakage current dominates the
power dissipation. In Figure 4.1 it can be seen that the power dissipation is
at a minimum around where the power dissipation due to the subthreshold
leakage current makes up 10% of the total power dissipation. This condition
is also depicted as a broken line in the figure to indicate the optimum
A quantitative analysis is found in [5] which leads to approximately the
same conclusion.
Lowering both and however, raises three problems. An
exponential increase in subthreshold leakage current due to reduction not
only shortens battery life in portable equipment but also disables the IDDQ
testing. For these reasons it is very difficult to lower below 0.3 volts. In
addition, significant delay increase due to variation at a low degrades
worst-case circuit speed. However, it is difficult to lower by means of
process and device refinement.
There are two approaches to solve these problems. Conventional power-
down schemes–either on a board or in a chip—can solve the battery life
problem. The other approach is to control through substrate bias, which
can solve all three problems.

A Variable Threshold-voltage CMOS (VTCMOS) technology [6][7][8]


controls by means of substrate bias control, as depicted in Figure 4.2.
The measured chip leakage current of an MPEG-4 chip fabricated in
VTCMOS technology is plotted in Figure 4.3. VTCMOS technology sets
the leakage current below 10mA in active mode and below in standby
Low Voltage Technologies 95

mode, independently from processed and temperature. The analytical


model, device design, and scaling scenario for VTCMOS technology are
found in [9].

Recently, the range of body bias has been extended from reverse to
forward. Forward substrate bias is used during active operation in order to
lower for high-speed operation, and zero substrate bias used during
standby mode in order to raise for low leakage. The substrate biasing
technique has begun to be applied to high-end products such as
microprocessors and communications chips for low-power, high-speed
operation [10][11].
An embedded DC-DC converter can vary the power supply voltage. If
both and are dynamically varied in response to computational load
demands, the energy/operation can be reduced for the low computational
periods, while retaining peak throughput when required. This strategy,
called dynamic voltage scaling (DVS), was first applied to a MIPS-
compatible RISC core in 1998 [7]. Measured performance in MIPS/W was
improved by a factor of more than two compared with that of a conventional
design. In 2000, a DVS processor with an ARM8 core was reported [12].
96 Low-Power Digital Circuit Design

Operating systems for voltage scheduling have also been extensively


investigated [13] [14]. The power efficiency of the embedded DC-DC
converter has been improved to 95% [15].
To probe further, [15] [17] are helpful references.

4.2.2 Dual

There are three ways to save power dissipation while maintaining maximum
operating frequency by utilizing surplus timing in non-critical paths: 1)
employing multiple power supplies to lower supply voltage, 2)
employing multiple threshold voltages to reduce leakage current, and
3) employing multiple transistor widths to reduce circuit capacitance.
Clustered voltage scaling employing two power supplies is
discussed first.
should be used to minimize power dissipation of circuits. A theory
to deal with the optimal is described in [18]. According to the theory,
the power reduction ratio R can be calculated as a function of when
p(t) is provided, in which p(t) represents the normalized number of paths
whose delay is t when The power ratio R is calculated for five
Low Voltage Technologies 97

artificial examples of p(t) in Figure 4.4. Interestingly, R becomes minimum


at between and for all the examples, even though the
minimum value of R depends on p(t). This means that should always be
set at around from to to minimize the power dissipation. In
order to verify this theory, a discrete cosine transform (DCT) block in an

MPEG-4 video codec is designed by using an EDA tool for the clustered
voltage scaling [19] at various and the power dissipation is monitored.
As shown in Figure 4.5, the experimental result shows a good agreement
with the theory when p(t) of lambda-shape is assumed. Power dissipation is
reduced by about 40%.
Two MPEG-4 video codec chips are developed by the two approaches,
controlling and and employing two [20]. The power
dissipation on the chips is simulated and measured. By optimizing and
the power supply voltage can be lowered to 2.5V from 3.3V so that power
dissipation is reduced by 43% in all the circuits. By employing one more
1.75V for non-critical circuits, power dissipation is further reduced by
25%, to a total of 55% compared to the conventional design at 3.3V.
98 Low-Power Digital Circuit Design

4.2.3 Multiple

In the past, single single and single W were employed in low-power


design. Recently dual dual and several W’s are often used for
low-power libraries. In the future, will many more multiple and
W’s be used for low-power design? How many parameters will be required
for what degree of power reduction? How will the parameters be optimized?
Which of the three approaches will be most effective?
Theoretical models are developed to answer these questions and to derive
knowledge for future design [21]. For simplicity, the theoretical models
assume non-crossing parallel signal paths that are composed of concatenated
gates.

4.2.3.1 Multiple Power Supplies

In multiple power supplies power dissipation is given by

where is total capacitance of circuits and interconnections that will


operate under and f is an operating frequency. The ratio of power
dissipation in the multiple power supplies compared to that in a single power
supply is given by

As shown in a design example of 64bit integer datapath for a CPU


core [21], delay and capacitance are mostly in proportion. Therefore, is
calculated by

where p(t) is a normalized path-delay distribution function, and is total


delay of circuits at that will operate under Consider a path whose total
delay t is between and where denotes path delay at that will be
equal to cycle time (=1) when all the circuits operate under Among many
Low Voltage Technologies 99

combinations of power supplies that make up the total delay of the path to
the cycle time, power dissipation is minimized when is applied.
Accordingly, is given by

where is given by

VT is threshold voltage, and α is velocity saturation index. From equations


(4.4)-(4.7) can be calculated for given p(t), and
Calculation results for dual supplies show a good agreement
with the simulation result in reference [18]. For triple power supplies
a computed 3-D graph and its contour lines are depicted in Figure
4.6.
In Figure 4.7 calculated optimum and the optimized power
dissipation are plotted. Taking the results of after-layout static timing
analysis into consideration, lambda-shaped p(t) is adopted here.
A rough rule of thumb for optimum is derived:

This rule of thumb gives almost optimum under which power is


reduced to the point that is within 1% difference from the precise minimum.
It is also understood from Figure 4.7 that the more the less power, but
the effect will be saturated. The power reduction effect will also be
diminished as the power supply voltage is scaled and as is higher.
The following equation gives a reasonably accurate approximation:
100 Low-Power Digital Circuit Design
Low Voltage Technologies 101

4.2.3.2 Multiple Threshold Voltages

In multiple threshold voltages chip leakage current is


given by

where is total gate width of pMOS and nMOS whose threshold voltage is
and whose source is connected to and The ratio of chip leakage
current in multiple threshold voltages to that in a single threshold voltage is
given by

In a typical design where buffer size and the number of repeaters are
optimally designed, delay and transistor width is mostly in proportion, and
is calculated by
102 Low-Power Digital Circuit Design

The chip leakage current ratio can be computed in the same way as in
Low Voltage Technologies 103

A computed 3-D graph for triple threshold voltages and its


contour lines are depicted in Figure 4.8. In Figure 4.9 calculated optimum
and the optimized chip leakage current are plotted. A rough rule of
thumb for the optimum is derived:

This rule of thumb provides nearly optimum as shown in Figure 4.9.


It is also understood from Figure 4.9 that the more the less leakage
current, but the effect will be saturated. The leakage reduction effect will
also be diminished as the power supply voltage is scaled and as is
higher. At the percentage in total transistor width in
and is 0.4%, 3%, 11 %, and 85%, respectively.
For those designs, such as high-end microprocessors, where power
dissipation due to leakage current makes up fairly large amount of power
104 Low-Power Digital Circuit Design

dissipation due to low reducing leakage current by more than one order
of magnitude is very effective.
For other designs where the leakage current is suppressed to a fairly
small amount, the leakage current reduction can be converted to a reduction
of AC power by lowering and, accordingly, are lowered to
the point where chip leakage current is the same as that in As a result,
AC power is reduced by about 20%.

4.2.3.3 Multiple Transistor Width

When multiple transistor width is employed, power


dissipation is given by

where is the total gate and diffusion capacitance of transistors whose


channel width will be scaled to and is the total interconnection
capacitance. The ratio of power dissipation when using the multiple
transistor width to that when using a single transistor width is given by

where m is From the design example presented in Figure 4.1,


and are calculated vs. delay as shown in Figure 4.8 and Figure 4.9,
respectively. Since delay and transistor capacitance is mostly in proportion,
m of 1.67 is used in this study. is calculated by

The power dissipation ratio can be computed in the same way as in


A computed 3-D graph for triple transistor width and its
contour lines are depicted in Figure 4.10. A rough rule of thumb for the
optimum W’s is derived:
Low Voltage Technologies 105

Circuit capacitance is reduced by 40%, which reduces 15% of the total


106 Low-Power Digital Circuit Design

capacitance.

4.2.3.4 Summary

To compare the three approaches, power reduction is about 50% by multiple


30% in a device, 20% in a device by multiple
and 15% by transistor sizing. However, power penalty in multiple
induced by level converters and layout inefficiency was not included in the
theory and should be taken into account. Furthermore, the power reduction
effect will be diminished in multiple and multiple as the power
supply voltage is scaled and as is higher, while the effect of multiple W’s
stays constant. However, the effect may be lessened as the interconnection
capacitance dominates in the future. The shorter the cycle time, the smaller
the chance that more than two approaches can be applied for the multiplier
effect. The most effective approach should be applied first. To answer the
questions from the beginning of the section, no reason has been found to use
more than 3 types of and W’s, in future.

4.2.4 Low Voltage SRAM

The threshold voltage of a transistor cannot be lower in an SRAM than in


a logic circuit because of three reasons: increased bit-line leakage current,
degraded cell-data stability, and increased static power dissipation. Among
Low Voltage Technologies 107
108 Low-Power Digital Circuit Design

these limitations, the bit-line leakage problem is becoming the most crucial.
The measured cell current and bit-line leakage in the worst case
data pattern of an SRAM with 256 rows, fabricated in a CMOS
technology, are depicted in Figure 4.11. Rapid increase in at low
degrades operation speed and finally causes operation error. If is kept
below should be higher than 0.35V, considering ±0.1V
fluctuation. As illustrated in Figure 4.12, has been about 23% of in
SRAMs and around 15% in logic circuits in the and high-
speed device generations. However, in a technology where
it is predicted that cannot be scaled to under the cover
of excessive which becomes three times as large as in the worst
case.
Low Voltage Technologies 109

Several schemes are proposed to solve the problem [22][23][24]. A


negative word-line scheme [22] may suffer from gate-induced drain leakage.
A dynamic leakage cut-off scheme [23] may degrade operation speed since
additional time is required for applying reverse substrate bias. A bit-line
leakage compensation (BLC) scheme [24] does not suffer from these
penalties.
A circuit for the BLC scheme is illustrated in Figure 4.13. Bit-line
leakage is detected by P2 (and P2B) in a pre-charge cycle, and the same
amount of current is injected to the bit-line for compensation by P4 (and
P4B) during a read/write cycle. P2 and P4 are symmetric for current mirror
operation. A control signal “/cal”passes and stores the detected current as
potential in CAP7 (and CAP7B), and a control signal “/comp” enables the
current injection. Area penalty caused by the additional circuit is only 3% in
a 1k x 8b SRAM.
Waveforms of the bit-lines simulated by SPICE for the
technology under 105C are depicted in Figure 4.14. Figure 4.14(a) and (b)
show waveforms in the conventional circuit with no leakage in both bit-
lines, and leakage only in “BL,” respectively. Due to the leakage,
potential difference between the two bit-lines is reduced in Figure 4.14(b).
On the other hand, in Figure 4.14(c) where the BLC scheme is employed,
the potential difference is kept almost the same as that in Figure 4.14(a), in
spite of the leakage. Figure 4.14(d) shows waveforms of the control
signals. It should be stated that after completion of the leakage detection, the
“/cal” signal should be reset so that the compensation as well as equalization
of the bit-lines can start immediately for the succeeding read/write operation.
110 Low-Power Digital Circuit Design

Even with the additional operation for leakage detection in the pre-charge
cycle, write-recovery can be completed with little speed penalty as shown in
Figure 4.14(c), since P3 assists P1 with the pre-charge operation while
“/comp” is low.
Capacitance associated with CAP7 should be carefully designed,
because a capacitance that is too small with cause charge sharing and reduce
injection current due to coupling noise from the source of P4, whereas a
capacitance that is too large with increase detection time and hence the pre-
charge cycle time.
Simulated delay time in which the potential difference between the two
bit-lines reaches l00mV is plotted in Figure 4.15. If the budget of the bit-
line delay is 0.5ns, the bit-line leakage should be less than without the
compensation, whereas it can be as large as with the proposed
compensation scheme. This advantage corresponds to 0.1 V reduction. It
is also found from Figure 4.12 that can be scaled to in the
technology as it was previously, when the BLC scheme is employed.

4.3 LOW SWITCHING-ACTIVITY TECHNIQUES

The place where the switching activity is the highest in a chip is a circuit
where clock comes in and out. A flip-flop consumes considerable power by
clock toggling. Typically, one-fourth to one-half of the total power
dissipation of a chip is consumed by flip-flops.

In a conventional flip-flop shown in Figure 4.16, 12 transistors out of 24


are subjected to charge and discharge in a clock cycle. In a clock-on-
demand flip-flop (COD-F/F), depicted in Figure17 [25], internal clock (CKI)
is activated only when the input data (D) is to change the output (Q), which
is an ultimate hardware clock gating. This type of flip-flop is called a
conditional flip-flop. Another circuit implementation is found in [26].
Low Switching-Activity Techniques 111

Figure 4.18 shows waveforms of COD-F/F simulated by SPICE. When


the D input is the same level as the Q output, CKI remains at the L-level
even when CLK rises. When the D input differs from the Q output, CKI will
still stay at the L-level until the next CLK input and will rise to the H-level
along with the rise of CLK. At this point the D input is captured by COD-
F/F, and the Q output changes accordingly. Immediately, CKI returns to the
112 Low-Power Digital Circuit Design

L-level because the Q output is now on the same level as the D input. In this
way, COD-F/F generates the self-aligned pulsed clock internally and
operates reliably. When CKI is a short pulse, a latch circuit can operate like
an edge-triggered flip-flop. Therefore, COD-F/F consists of the clock-gating
circuit and the latch circuit, which reduces area and power penalties. Since
the internal clock is generated and distributed only in a cell, and the pulse
width is self-aligned by the Q transition, no distortion problem occurs.
The power dissipation dependence on the data switching probability, pt,
is shown in Figure 4.19. For example, at pt of 0.3, COD-F/F consumes 50%
less power than the conventional flip-flop. The lower the pt, the lower the
power dissipation. Power penalty due to the clock-gating circuit is almost
cancelled out by the circuit reduction from a flip-flop to a latch. For pt less
than 0.95, the COD-F/F dissipates less power than the conventional flip-flop.
Since pt in logic circuits is around 0.3 on average, and 0.5 at most, the
conventional flip-flop should always be replaced by COD-F/F as long as the
delay penalty can be accepted. In Table 4.1, characteristics of COD-F/F are
presented.

In the conventional design, the best single flip-flop is employed


uniformly in all the circuits in a chip. A different design strategy, namely
F/F blending, has been proposed [25]. Different types of flip-flops with
different merits are blended to trade off power, delay, and the area of a chip.
Low Switching-Activity Techniques 113

Blending a high-speed flip-flop can compensate for the speed penalty of


COD-F/F.
A latch-based design is more aggressive in terms of speed than a flip-flop
based design because cycle time can be borrowed. However, it is more
difficult to verify timing in the latch-based design, and hence, there is a
longer design turn-around-time. For this reason, the flip-flop based design is
popular except for high-end, high-volume products. A flip-flop with
negative setup-time for cycle time borrowing exhibits both circuit speed and
design easiness.

Four types of negative setup-time flip-flops (NS-F/Fs) are designed for


borrowing a few nanoseconds from the succeeding cycle. As shown in
Figure 4.20, a delay element is inserted to shift the internal-clock (CKI).
114 Low-Power Digital Circuit Design

Since clock signal is simply delayed locally in the conventional flip-flop,


a large negative setup-time can be obtained exactly. Characteristics of NS-
F/F are presented in Table 4.1.
Various flip-flops with different merits and demerits are available now.
In order to take advantage of the merits and compensate for the demerits, the
key is a blending methodology, or putting the right flip-flop in the right
place.
Three types of discrete cosine transform (DCT) blocks for an MPEG-4
video coder are designed by F/F blending. First, a conventional DCT is
designed as a reference by a conventional design technique. Only the
conventional flip-flop depicted in Figure 4.16 is registered in a cell library as
a flip-flop. Logic synthesis is performed with a commercial logic synthesis
tool, targeting on the highest operating frequency in the minimum layout
area. 80MHz maximum operating frequency is obtained.
Then a low-power DCT (LP-DCT) is designed, using two design steps.
First, only COD-F/F is registered in the cell library as a flip-flop and logic is
synthesized for anoperating frequency somewhat slower than 80MHz, taking
the delay penalty of COD-F/F into account. In this stage COD-F/F is
mapped to all the flip-flops to minimize chip power dissipation. Secondly,
the conventional flip-flop is included into the library, and the logic synthesis
is performed for 80MHz operation. The conventional flip-flop replaces
some of the COD-F/Fs to meet timing constraints. As a result, 553 COD-
F/Fs are blended in the chip as shown in Table 4.2. The blending ratio is
about 80/20.

Next a high-speed DCT (HS-DCT) is designed, including the 4 types of


NS-F/Fs in the cell library together with the conventional flip-flop. 100MHz
operation is obtained, which is a 25% speed improvement. The number of
flip-flops used in the design is found in Table 4.2.
Path delay is analyzed by a static timing analyzer. The path delay
distributions are depicted in Figure 4.21. Timing slack is remarkably
Low Switching-Activity Techniques 115

reduced by the cycle time borrowing. The more cycle time borrowing, the
more chances of the cycle time borrowing from the succeeding cycle in the
next pipeline stage. In this way, the cycle time borrowing is propagated,
resulting in larger area and power penalties due to the NS-F/F. Since cell
size becomes larger as the negative setup-time is larger in NS-F/Fs, the logic
synthesis tool automatically maps a necessary and small NS-F/F after area
optimization to minimize penalties as well as to satisfy the timing
requirements. The fact that NS-F/F#4 is not used exclusively provides some
support for an assumption that they are appropriately mapped to minimize
the penalties. Since clock edge is locally shifted by various amounts, the
hold-time violation should be given careful attention. A logic synthesis tool
can fix these problems automatically. The area penalty is in the order of
several percentage points.

The three DCT blocks are fabricated in a CMOS double-well,


triple-metal technology. Measured results are shown in Table 4.3.
116 Low-Power Digital Circuit Design

As for low-power DCT, power dissipation is reduced by 24% for the random
picture and 51% for the still-picture while maintaining 80MHz maximum
operating frequency. Area is 12% larger as a result of gate sizing
compensating for the delay increase in COD-F/F. Power breakdown
analyzed by simulation is shown in Figure 4.22. The “others” include
combinational logics, SRAMs, and clock trees. The power difference
between Conv-DCT and LP-DCT comes from the difference of the power
dissipation of the flip-flops. HS-DCT is operated at a maximum frequency
larger by 25% than that of the conventional DCT. The area penalty is due to
the increased cell size of NS-F/Fs and gate-sizing to meet tight timing
constraints. The power penalty is mainly due to the delay circuit in NS-F/F.
F/F-Blending explores better tradeoffs between power, delay, and area.
Since neither RTL design nor the timing constraints have to be modified,
designers can use the technique without knowledge of the chip. This
technique also shortens design turn-around-time. There is no need to change
device technology, and, hence, no increase in process cost.
Low Switching-Activity Techniques 117
118 Low-Power Digital Circuit Design

4.4 LOW CAPACITANCE TECHNOLOGIES

Accessing external memory consumes considerable power dissipation due to


large parasitic capacitance along external signal lines. An MPEG-4
videophone LSI with 16Mbit embedded DRAM is fabricated in a
embedded DRAM process [27]. The chip micrograph is shown in Figure
4.23. As depicted in Figure 4.24, measured power dissipation is about one
third of the total power of a multi-chip solution.

4.5 SUMMARY

Power wall is a clear and present roadblock in the semiconductor industry.


Since there is no new energy efficient technology on the horizon, designers
should take on low-power CMOS design as the current challenge. General
guidelines for power reduction are to lower supply voltage, switching
probability, and capacitance. CMOS power dissipation can be reduced to
x0.8-x0.5 by lowering supply voltage, x0.8-x0.5 by lowering switching
activity, and x0.8-x0.3 by lowering capacitance. In total, power reduction to
x0.5-x0.1 can be achieved by design efforts. Suppose the natural power
increase due to device scaling doubles every 6.5 years; this power reduction
corresponds to expansion of CMOS life span by about 6-20 years. Even
though design challenges should be taken further, such as a better algorithm
to lower clock frequency, a new set of technology is required as a long-term
solution to the power problems, at least for mass production in the 2010s. It
is hoped that innovations will occur as they did before. To this end, it is
important to know the impact of the evolution of integrated circuits to our
society and to promote research efforts in the world.

REFERENCES
[1] T.Kuroda, and T. Sakurai, “Overview of low-power ULSI circuit techniques,” IEICE
Trans. On Electronics, vol. E78-C, no. 4, pp. 334-344, April 1995.
[2] T. Kuroda, “CMOS design challenges to power wall,” in Proc. of International
Microprocesses and Nanotechnology Conference, pp. 6-7, Nov. 2001.
[3] T. Kuroda, “Low power CMOS design challenges,” IEICE Trans. Electronics, vol. E84-
C, no. 8, pp.1021-1028, Aug. 2001.
[4] Chandrakasan, S. Sheng, and R. Brodersen, “Low-power CMOS digital design,” IEEE
Journal of Solid-State Circuits, vol. 27, no. 4, pp. 473-484, Apr. 1992.
[5] K. Nose, and T. Sakurai, “Optimisation of VDD and VTH for Low-Power and High-
Speed Applications,” in Proc. of ASPDAC, pp. 469-474, Jan. 2000.
Summary 119

[6] T. Kuroda, T. Fujita, S. Mita, T. Nagamatu, S. Yoshioka, K. Suzuki, F. Sano, M.


Norishima, M. Murota, M. Kako, M. Kinugawa, M. Kakumu, and T. Sakurai, “A 0.9V
150MHz 10mW 4mm2 2-D discrete cosine transform core processor with variable-
threshold-voltage scheme,” IEEE Journal of Solid-State Circuits, vol. 31, no. 11, pp.
1770-1779, Nov. 1996.
[7] T. Kuroda, K. Suzuki, S. Mita, T. Fujita, F. Yamane, F. Sano, A. Chiba, Y. Watanabe, K.
Matsuda, T. Maeda, T. Sakurai, and T. Furuyama, “Variable supply-voltage scheme for
low-power high-speed CMOS digital design” IEEE Journal of Solid-State Circuits, vol.
33, no. 3, pp. 454-462, Mar. 1998.
[8] T. Kuroda, T. Fujita, F. Hatori, and T. Sakurai, “Variable threshold-voltage CMOS
technology,” IEICE Trans. on Electronics, vol. E83-C, no. 11, pp. 1705-1715, Nov.
2000.
[9] H. Im, T. Inukai, H. Gomyo, T. Hiramoto, and T. Sakurai, “VTCMOS characteristics and
its optimum conditions predicted by a compact analytical model,” in ISLPED ’01 Dig.
Tech. Papers, pp. 123-128, Aug. 2001.
[10] S. Narendra, M. Haycock, V. Govindarajulu, V. Erraguntla, H. Wilson, S. Vangal, A.
Pangal, E. Seligman, R. Nair, A. Keshavarzi, B. Bloechel, G. Dermer, R. Mooney, N.
Borkar, S. Borkar, and V. De, “1.1V 1GHz communications router with on-chip body
bias in 150nm CMOS,” in ISSCC’02 Dig. Tech. Papers, pp. 270-271, Feb. 2002.
[11] S. Vangal, N. Borkar, E. Seligman, V. Govindarajulu, V. Erraguntla, H. Wilson, A.
Pangal, V. Veeramachaneni, M. Anders, J. Tschanz, Y. Ye, D. Somasekhar, B. Bloechel,
G. Dermer, R. Krishnamurthy, S. Narendra, M. Stan, S. Thompson, V. De, and S.
Borkar, “A 5GHz 32b Integer-execution core in 130nm Dual-VT CMOS,” in ISSCC’ 02
Dig. Tech. Papers, pp. 412-413, Feb. 2002.
[12] T. Burd, T. Pering, A. Stratakos, and R. Brodersen, “A dynamic voltage scaled
microprocessor system,” IEEE Journal of Solid-State Circuits, vol. 35, pp. 1571-1580,
Nov. 2000.
[13] T. Ishihara, and H. Yasuura, “Voltage scheduling problem for dynamically variable
voltage processors,” in ISLPED ’98 Dig. Tech. Papers, pp. 197-202, Aug. 1998.
[14] S. Lee, and T. Sakurai, “Run-Time Voltage Hopping for Low-Power Real-Time
Systems,” in Proc. of DAC, pp. 806-809, Jun. 2000.
[15] F. Ichiba, K, Suzuki, S. Mita, T. Kuroda, and T. Furuyama, “Variable supply-voltage
scheme with 95%-efficiency dc-dc converter for MPEG-4 Codec,” in ISLPED Dig. Tech.
Papers, pp. 54-58, Aug. 1999.
[16] Chandrakasan, W. Bowhill, and F. Fox. Ed. Design of High-performance
Microprocessor Circuits, Chapter 4, IEEE Press, Piscataway, NJ, 2000.
[17] V. Oklobdzija, Ed. The Computer Engineering, Section IV, CRC Press, New York,
2002.
[18] T. Kuroda, and M. Hamada, “Low-power CMOS digital design with dual embedded
adaptive power supplies,” IEEE Journal of Solid-State Circuits, vol. 35, no. 4, pp.652-
655, April 2000.
[19] K. Usami, M. Igarashi, T. Ishikawa, M. Kanazawa, M. Takahashi, M. Hamada, H.
Arakida, T. Terazawa, and T. Kuroda, “Design methodology of ultra low-power mpeg4
codec core exploiting voltage scaling techniques,” in Proc. of DAC, pp. 483-488, June
1998.
[20] M. Takahashi, M. Hamada, T. Nishikawa, H. Arakida, T. Fujita, F. Hatori, S. Mita, K.
Suzuki, A. Chiba, T. Terasawa, F. Sano, Y. Watanabe, K. Usami, M. Igarashi, T.
Ishikawa, M. Kanazawa, T. Kuroda, and T. Furuyama,“A 60mW MPEG4 video codec
120 Low-Power Digital Circuit Design

using clustered voltage scaling with variable supply-voltage scheme,” IEEE Journal of
Solid-State Circuits, vol. 33, no. 11, pp. 1772-1780, Nov. 1998.
[21] M. Hamada, Y. Ootaguro, and T. Kuroda, “Utilizing surplus timing for power
reduction,” in Proc. of CICC’2001, pp. 89-92, May 2001.
[22] H. Tanaka et al., “A Precise on-chip voltage generator for a gigascale dram with a
negative word-line scheme,” IEEE Journal of Solid-State Circuits, vol.34, pp. 1084-1090,
Aug. 1999.
[23] H. Kawaguchi et al., “Dynamic leakage cut-off scheme for low-voltage SRAM’s,” in
Symp. on VLSI Circuits Dig. Tech. Papers, pp. 140-141, June 1998.
[24] K. Agawa, H. Hara, T. Takayanagi, and T. Kuroda, “A bit-line leakage compensation
scheme for low-voltage SRAM’s,” IEEE Journal of Solid-State Circuits, vol. 36, no. 5,
May 2001.
[25] M. Hamada, T. Terazawa, T. Higashi, S. Kitabayashi, S. Mita, Y. Watanabe, M. Ashino,
H. Hara, and T. Kuroda, “Flip-flop selection technique for power-delay trade-off,” in
ISSCC’99 Dig. Tech. Papers, pp. 270-271, Feb. 1999.
[26] B. Kong, S. Kim, and Y. Jun, “Conditional-capture flip-flop technique for statistical
power reduction,” in ISSCC ’OO Dig. Tech. Papers, pp. 290-291, Feb. 2000.
[27] T. Nishikawa, M. Takahashi, M. Hamada, T. Takayanagi, H. Arakida, N. Machida, H.
Yamamoto, T. Fujiyoshi, Y. Maisumoto, O. Yamagishi, T. Samata, A. Asano, T.
Terazawa, K. Ohmori, J. Shirakura, Y. Watanabe, H. Nakamura, S. Minami, and T.
Kuroda, “A 60MHz 240mW MPEG-4 video-phone LSI with 16Mb embedded DRAM,”
in ISSCC Dig. Tech. Papers, pp. 230-231, Feb. 2000.
Chapter 5
Low Voltage Analog Design

K. Uyttenhove and M. Steyaert


Katholieke Universiteit Leuven, ESAT-MICAS

Abstract: The current trend towards low-voltage, low-power design is mainly driven by
two important aspects: the growing demand for long-life autonomous portable
equipment (cellular phones, PDAs, etc.) and the technological limitations of
high-performance VLSI systems (heat dissipation). These two forces are now
combined as portable equipment grows to encompass high-throughput
intensive products such as portable computers and cellular phones. The most
efficient way to reduce the power consumption of digital circuits is to reduce
the supply voltage since the average power consumption of CMOS digital
circuits is proportional to the square of the supply voltage. The resulting
performance loss can be overcome for standard CMOS technologies by
introducing more parallelism and/or modifying the process and optimizing it
for low-voltage operation. The rules for analog circuits are quite different than
those applied to digital circuits. It will be shown that the downscaling of the
supply voltage does not automatically decrease the analog power consumption.
After a general introduction on the limits to low power for analog circuits, an
extensive part of this chapter will deal with the impact of reduced supply
voltage on the power consumption of high-speed analog to digital converters
(ADC). It will be shown that power consumption will not decrease and, even
worse, will increase in future submicron technologies. This trend will be
shown and solutions will be offered at the end of this chapter. A comparison
with the power consumption of published high-speed analog to digital
converters will also be presented.

Key words: Low voltage operation, analog circuits, low power analog design, ADC,
matching, voltage scaling.
122 Low Voltage Analog Design

5.1 INTRODUCTION

5.1.1 Fundamental Limits to Low Power Consumption

Power is consumed in analog signal processing circuits to maintain the


signal energy above the thermal noise floor in order to achieve the required
signal to noise ratio (SNR) or dynamic range (DR). A representative figure
of merit of different signal processing systems is the power consumed to
realize a single pole. This can be realized by considering the basic integrator
shown in Figure 5.1, assuming that all the current pulled from the power
supply is used to change the integrating capacitance (this means that the
analog building block uses the available current with an efficiency of 100%,
otherwise efficiency factors must be used in the following analysis
dependent of the mode of operation (Class A, Class B, Class C, Class E...)).
The minimum power consumed from the supply voltage which is
necessary to create a sinusoidal voltage V(t) across capacitor C having a
peak-to-peak amplitude and a frequency f can be expressed as [1]:

where k is the Boltzmann constant, and T is the absolute temperature.


According to this equation, the minimum power consumption of analog
circuits at a given temperature is basically set by the required DR and the
frequency of operation (or the required bandwidth). Since the minimum
power consumption is also proportional to the ratio between the supply
voltage and the signal amplitude, power efficient analog circuits should be
Introduction 123

designed to maximize the voltage swing. This minimum necessary power


consumption is nowadays never reached (see Figure 5.2, [1]) because of the
multiple effects that increase the power consumption (some of them will be
pointed out in the next section on practical limits).

5.1.2 Practical Limitations for Achieving the Minimum


Power Consumption

The limits discussed so far are fundamental since they do not depend on
the technology nor the choice of the supply voltage. However, a number of
obstacles or technological limitations are in the way of approaching these
limits in practical circuits, and ways to reduce the effect of these various
limitations can be found at all levels of analog design ranging from device to
system level:

Capacitors increase the power necessary to achieve a certain bandwidth.


They are only acceptable if their presence reduces the noise power (by
124 Low Voltage Analog Design

reducing the noise bandwidth). So, parasitic capacitances are very bad for
minimum power consumption.
Power is increased if the signal at any node corresponding to a functional
pole (pole within the bandwidth) has voltage amplitude smaller than the
supply voltage. Thus, care must be taken to amplify the signal as early as
possible to its maximum possible voltage value. Using current-mode
voltage swings is therefore not a good approach to reduce power, as long
as the energy is supplied by a voltage source.
The presence of additional noise sources implies an increase in power
consumption. These include 1/f noise of the active devices and noise
coming from the power supply or generated on chip by other blocks of
the circuit.
When capacitive loads are used, the power supply current “I” necessary
to obtain a given bandwidth is inversely proportional to the gm/I ratio of
the active device. The small value of gm/I inherent to MOS transistors
operating in strong inversion may therefore cause an increase in power
consumption.
The need for precision (e.g., in high-speed flash ADCs, as explained
further on) leads to the use of larger dimensions for active and passive
components, with a resulting increase in parasitic capacitances and
power.
All switched capacitors must be clocked at a frequency higher than twice
the signal frequency. The power consumed by the clockdriver itself may
be dominant in some applications.

5.1.3 Implications of Reduced Supply Voltage

The reason why deeper submicron technologies use lower supply voltages
will be shown in the second part of this chapter, which deals with the
influence of downscaling technology on the power consumption of high-
speed flash ADCs. In this section a general overview will be given on the
implications of reducing the supply voltage.
According to the equation for minimum analog power consumption,
reducing the analog supply voltage while preserving the same bandwidth and
DR has no fundamental effect on their minimum power consumption.
However this absolute limit was obtained by neglecting the possible
bandwidth B limitation due to the limited transconductance gm of the active
device. The maximum value of B is proportional to gm/C. It can be shown
that by replacing the capacitor value C by gm/C the following equation can
be written:
Introduction 125

In most cases, scaling the supply voltage with a factor requires a


proportional reduction of the signal swing Vpp. Having the same SNR and
bandwidth is therefore only possible if the transconductance gm is increased
by a factor Two different cases can now be distinguished:

If the active device is a bipolar device its transconductance can only be


increased by increasing the bias current with the same factor so the
power consumption increases with a factor
If the active device is a MOS device the transconductance is proportional
to I/(Vgs-Vt) where Vgs-Vt is the gate overdrive voltage.
a) If Vgs-Vt scales down with the same factor than the power of
the circuit remains constant
b) If Vgs-Vt does not scale down, than the power of the circuit
increases with a factor
The effect of decreasing or not of the gate-overdrive voltage will be further
examined in the section on high-speed ADCs.

As a conclusion of the general introduction, one can state that


downscaling power supply voltage in mixed mode circuits will certainly
decrease the power consumption of the digital circuits but not necessarily the
power consumption of the analog circuitry. Instead, with all the parasitic
effects, power consumption will probably increase in future low-voltage
deep-submicron technologies. Solutions to this problem can be found on the
various design abstraction levels and will be discussed at the end of this
chapter.
The next major part of this chapter discusses a more practical example:
the influence of technology downscaling on the power consumption of high-
speed ADCs. As a case example, a flash ADC architecture is used to
research the problems with the downscaling of technology. One important
difference between the following part and the previous part is the fact that
noise levels are much lower than the quantization noise levels in low-
resolution flash ADCs. But remember that in the end, the noise level is the
ultimate limitation leading to the minimal power consumption for analog
building blocks, which must achieve a certain speed and dynamic range.
126 Low Voltage Analog Design

5.2 SPEED-POWER-ACCURACY TRADE-OFF IN


HIGH SPEED ADC’S

5.2.1 High-speed ADC Architecture

An A/D conversion algorithm is describing the functional operation of the


ADC. The architecture of the ADC is the translation of this algorithm in
hardware.

The choice of architecture is strongly related to the system design. This


system design is ruled by the trade-off between the performance and
hardware cost of its building blocks. From literature, high-speed,
low/medium resolution ADC architectures can be roughly divided in three
groups: full flash, folding/interpolating, and pipelined architectures. Each of
these architectures has its own place in the resolution-bandwidth picture
shown in Figure 5.3. Flash-type architectures are typically the fastest
structures that are used to implement low resolution, high-speed ADCs.
Speed-power-accuracy Trade-off in High Speed ADC’s 127

Figure 5.4 presents a typical block diagram of an n-bit flash converter.


The resistive ladder subdivides the converters reference voltages (+Vref -
Vref) in a set of reference voltages on chip, which are compared in parallel
with the analog input signal. A logic encoder converts the thermometer code
generated by all the comparators into a binary code that approximates the
input signal every clock cycle. Note that the major advantages (simplicity
and parallelism) of flash architectures also present its main problem: the
number of comparators increases exponentially with the resolution
specification, leading typically to a large die area and a high power
consumption. Therefore, this architecture is only used for n < 8. For higher
resolutions, analog pre-processing steps, like folding/interpolating or
pipelining, are used to break this exponential relationship.
As can be seen from the block diagram of the flash-converter, the correct
operation of a flash converter depends on the accurate definition of the
reference voltages sensed by each comparator. Since the comparator offset
voltage is a random variable (which depends on the matching properties of
128 Low Voltage Analog Design

the technology used [2]), it directly influences the DNL and INL
characteristics of the A/D converter. Therefore, the first step in the design of
a flash converter consists in deriving an offset voltage standard deviation
that guarantees with a high probability that the design complies with a
certain performance specification (high yield). The yield of the analog part
(e.g., ADC) of a mixed-mode chip must be much higher than the overall
yield, because of the relatively small area contribution of the analog part (see
Figure 5.5).

Consider that the offset voltages of all the comparators are independent
variables that follow a normal distribution. Monte Carlo simulations have
been used to estimate the design yield as a function of the offset voltage
standard deviation (Closed form expressions also exist for calculating the
Speed-power-accuracy Trade-off in High Speed ADC’s 129

yield as a function of offset standard deviation [3]) The results obtained for a
6-bit converter, considering that one wants to comply with a DNL and an
INL specification of 0.5 LSB and 1.0 LSB, respectively, are presented in
Figure 5.6. So in order to design a 6-bit converter with an acceptable yield,
the comparator offset standard deviation should not exceed the value of 0.15
times the least significant bit. The next section will show models to calculate
the offset standard deviation of the comparator. Together with this
information, the trade-off between speed, power, and accuracy will be shown
in section 3.

Figure 5.7 shows an overview of some very high speed ADCs


(references are situated at the end of the article). The resolution of all these
converters is 6-bit and the technology used is indicated on the figure.
Sampling speeds are increasing because of technology scaling, without
consuming more power. This is one of the implications of the trade-off
between speed-power-accuracy that will be discussed later on.

5.2.2 Models for Matching in Deep-submicron Technologies

5.2.2.1 What is Transistor Mismatch?

Two identically designed devices on an integrated circuit have random


differences in their behaviour and show a certain level of random mismatch
in the parameters that model their behaviour. This mismatch is due to the
stochastic nature of physical processes that are used to fabricate the device.
In [2] the following definition for mismatch is given: mismatch is the
130 Low Voltage Analog Design

process that causes time-independent random variations in physical


quantities of identically designed devices.

5.2.2.2 Transistor Mismatch Modelling

The mismatch of two CMOS identical transistors is characterized by the


random variation of the difference in their threshold voltages their body
factor their current factor and their distance from each other. For
technologies with a minimal device size larger than, typically, a widely
accepted and experimentally verified model for these random variations is a
normal distribution with mean equal to zero and a variance dependent on the
gate-width W, the gate-length L, and the mutual distance D between the
devices [2]:

and are process-dependent parameters. In Table 5.1 the


proportionality constants for several processes are summarized.
Experimental data show that the correlation between the and mismatch
is very low although both parameters depend on the oxide thickness [4].

The corner distance is defined as the distance D for which the


mismatch due to the distance effect is equal to the mismatch due to the size
Speed-power-accuracy Trade-off in High Speed ADC’s 131

dependence for a minimal size device. The obtained critical distances are
very large compared to the typical size of an analog circuit. Therefore, the
distance dependence of the parameter mismatch will be neglected in the
following sections.

Table 5.1 shows clearly the dependence of on the technology


whereas remains almost constant. It can be shown that the threshold
mismatch parameter is proportional to the oxide thickness of the process.
As shown in [4] the threshold mismatch parameter is primarily dependent
on the bulk charge fluctuations. To convert these charge fluctuations into a
voltage fluctuation at the gate of the transistor, the gate capacitance plays an
important role:

Together with the expression of the standard deviation of the bulk


charges this leads to the following expression for the mismatch parameter:

where is the total implant dose.


The oxide thickness scales with the minimum technology length to
reduce the short channel effects. As a consequence, the use of a deep-
submicron technology improves the threshold matching.
Figure 5.8 shows the decrease of the threshold mismatch parameter
as a function of the technology (or the minimum transistor length). The same
is done for the current mismatch parameter as shown in Figure 5.8. As
long as the threshold mismatch dominates over the mismatch, technology
scaling improves the overall mismatch. This will be discussed in detail in
section 3.
In high-speed analog designs, the designer prefers to use small gate
lengths so that the highest intrinsic speed for the transistor is obtained;
accurate models for minimum sized transistors are thus necessary. For the
accurate modeling of the threshold mismatch in submicron technologies the
simple linear model has to be extended for short and narrow channel effects.
The threshold voltage is dependent on the flat-band voltage, the surface
potential, the depletion charge, and the gate capacitance. It has been
experimentally verified that the mismatch of the threshold voltage is mainly
132 Low Voltage Analog Design

determined by the mismatch of the bulk depletion charges in the two


devices. In submicron technologies two effects introduce errors in the model.
Due to the presence of the source and the drain diffusion areas and the
charge sharing effect, part of the channel depletion charge is not controlled
by the gate voltage anymore. For devices with a small gate length, this
charge is a relatively large part of the depletion charge. As a consequence,
the threshold voltage lowers for small gate lengths whereas the variance of
the mismatch increases. A similar explanation can be given for small gate
widths. These effects can be taken into account if following equations are
used [4]:
Speed-power-accuracy Trade-off in High Speed ADC’s 133

As shown in the first section, a flash ADC consists of an array of


preamplifiers (differential pair structure) followed by comparators.

To end this section and introduce the next section, which deals with the
trade-off between speed, accuracy, and power, mismatch equations for a
differential pair configuration, shown in Figure 5.9, are deduced.

The next equation shows the current of a transistor as a function of the


and the threshold voltage

The input-referred offset of a differential pair can be derived from this


equation and is given by:

After substituting the equations for the mismatch, (5.1) and (5.2), the
offset voltage can be written in terms of the mismatch parameters and
of the technology used.
134 Low Voltage Analog Design

From (5.7) it can be concluded that the current and threshold matching
depends on and and that the relative importance of threshold
mismatch and current mismatch depends on the gate overdrive voltage. A
corner gate overdrive voltage is defined for which the effect of
the and mismatch on the gate voltage or drain current is of equal size
(see Table 5.1 for values):

For circuits with a bias point with a smaller than


the effect of mismatch is dominant, whereas for a larger than
the effect of mismatch dominates. It is clear that in practical
circuits the will be smaller than the corner gate overdrive voltage
so that only the mismatch is dominant. In practice, the offset voltage can
be approximated by: should there be an equation here?
The approximation error is equal to and is
small for typical bias conditions.

5.2.2.3 Speed-power-accuracy Trade-off

The trend in analog circuit design has always been towards higher speed,
higher accuracy, and lower power drain. However, it will be shown that the
speed-accuracy-power trade-off is simply limited by technology parameters
only and, more specifically, the mismatch parameters of the technology and
not the noise level as shown in Figure 5.2. This has already been
demonstrated by Kinget et al. [6].
The only way to overcome this problem is by using offset compensation
or auto-zero techniques (analog or digital, background or foreground).
However, those compensation techniques require calibration phases during
which the normal system operation is interrupted, and the offset voltages of
the building blocks are sampled and dynamically stored in a memory. This
reduces the maximum processing speed and requires a lot of extra chip
overhead to provide calibration and replica circuits. In many high-speed low-
power circuits the interruption of the system can not be tolerated or the
required continuous operation is too long to ensure the offset correction.
Therefore, the accuracy completely depends on the matching performances
of the technology. The bit accuracy that can be achieved is proportional to
the matching of the transistor. To improve the system accuracy, larger
devices are required, but, at the same time, the capacitive loading of the
circuit nodes increase and more power is required to attain a certain speed
Speed-power-accuracy Trade-off in High Speed ADC’s 135

performance. This can very easily be derived for a typical differential input
stage of a high speed ADC. The speed performance of such a topology is
approximated by:

On the other hand, the accuracy that can be achieved in a system is


proportional to the matching accuracy of the components. Equation (5.1)
was derived for the offset of a differential amplifier, so the accuracy
(maximum input divided by the maximum offset) is given by1:

The power drain in that circuit is a function of the power supply


and the current consumption or

If equations (5.10)(5.11)(5.12) are combined into the


product the result (for a fixed supply voltage) is

(The overdrive voltage is selected to maximize the trade-off and to


ensure strong inversion of the transistors, e.g., 0.2V).
On the other hand, the fundamental limit in the speed-power-accuracy
trade-off is imposed by thermal noise [7]:

where is the Boltzmann constant, and T is the absolute temperature. This


fundamental limit is independent of technology. For modern technologies
1
By using the 3 sigma value of the offset, the accuracy specification is met with a probability
of about 99.7%.
136 Low Voltage Analog Design

this fundamental limit is orders of magnitude lower than the technological


limit derived before. In other words, for present-day CMOS technologies,
the performance of precision analog circuits is limited by transistor
mismatch and not by noise (see also Figure 5.2).
It has been shown that the relationship given by equation (5.13) still
holds for more complex circuits [6], such as current processing circuits
(current mirrors), voltage processing circuits (differential pairs and op-amps)
and even multi-stage circuit designs. The impact of the relationship above is
that for the circuits of today, which are after high speed, high performance or
accuracy, and low power drain, a technological limit is encountered, namely
the mismatch of the devices. This means that for a given technology, if high
speed and high accuracy is required, this can only be achieved by consuming
power. For example, if one bit extra accuracy is required in the design of AD
and DA converters the power drain for the same speed performance will
increase by a factor of 4!
This trade-off has also been shown in a fitting model for high-speed ADC
in [8]. In this article following equation is derived:

ENOB can be taken as an accuracy measurement, as a speed factor


and as a technological constant.
The derived performance limit caused by mismatch is, of course, only
valid for converter architectures for which accuracy relies on component
matching (unlike architectures in which accuracy is limited by noise).

5.3 IMPACT OF VOLTAGE SCALING ON TRADE-


OFF IN HIGH-SPEED ADC’s

In the previous section, a fundamental trade-off between speed, accuracy,


and power was deduced:

What happens with this trade-off when technology scales down? Several
scaling trends have been proposed, but the “constant field scaling” is
probably the most used in the microelectronics community. Several scaling
Impact of Voltage Scaling on Trade-off in High-speed ADC’s 137

trends will now be discussed and used to explore the impact of these scaling
issues on the previously deduced speed-power-accuracy trade-off.
To reduce the short channel effects in deep-submicron transistors, the
oxide thickness is scaled down together with the minimum transistor length.
As shown in the second section, the threshold mismatch parameter is
proportional with the oxide thickness. As a consequence, the threshold
mismatch parameter decreases as technology scales down. The gate-
oxide capacitance on the other hand increases when technology scales down
(inversely proportional with the oxide thickness). So, increases as
technology scales down, and as a result the trade-off becomes better. This
means that for the same speed and accuracy, less power is needed when
technology is scaled down.

However, the maximal supply voltage also reduces for smaller oxide
thicknesses (see Figure 5.10) so that smaller signal levels have to be used
technology uses a 2.5 V supply, uses a 1.8 V supply).
When the supply voltage becomes smaller, the input swing of the differential
pair decreases leading to smaller values for the least significant bit. As a
result, the maximum allowable offset also decreases. Consequently, the
scaling advantage for the trade-off with smaller technology line-widths is
reduced. Moreover, the increasing substrate doping levels in deeper
submicron technologies make the parasitic drain to bulk and source to bulk
capacitors relatively more and more important compared to the gate-oxide
capacitance. This effect is clearly seen in Figure 5.11 where the and the
is plotted as a function of minimum technology length:
138 Low Voltage Analog Design

In fact the is a staircase function (now and then technology improves


drain-bulk capacitance [9]). This results in extra capacitive loading of the
signal nodes and requires extra power to attain high-speed operation.
So, although the intrinsic matching quality of the technology improves
for submicron and deep-submicron technologies, practical limitations make
the theoretical boundary harder to achieve.
At the start of the mismatch analysis, the relative importance of threshold
voltage and current factor mismatch were compared. For present-day
processes the impact of the mismatch is clearly dominant. When the
scaling trends for and are compared, it is evident that the mismatch
gains in importance for deeper submicron technologies. This trend is
confirmed by the decreasing values of the corner gate-overdrive voltage in
Table 5.1 for different technologies. For some technology in the future the
mismatch will be at least as important and even more important than the
mismatch for the calculation of the accuracy of circuits in the whole strong
inversion region. At that point the minimal power consumption for a given
speed and accuracy is proportional to this indicates that a
further scaling of the technology would not further improve the performance.
An example will further illustrate the scaling issues that degrade the
speed-power-accuracy trade-off in high speed ADCs.
Consider a 6-bit, 500 MSample/s CMOS ADC in two technologies, e.g.,
and CMOS. First, the supply voltages are supposed to be
equal; second the mismatch is expected to be dominated by the threshold
mismatch, and, third, the drain-bulk capacitance is neglected.
Impact of Voltage Scaling on Trade-off in High-speed ADC’s 139

Because the two ADCs have the same resolution, the following equation
can be proven2:

To achieve the same acquisition speed, the regenerative time constant


should be the same, leading to the next equation:

Assuming equal gate-overdrive voltages, the power drain can be


compared is proportional to the oxide thickness):

So, to achieve the same speed and accuracy, the power in the downscaled
technology is smaller, because of the improved matching of this technology.
Now, some modifications will be performed on these equations to
include the supply-voltage scaling and the relatively increasing importance
of the drain-bulk capacitance compared to the gate-oxide capacitance.
Normally the input range of the ADC is made as high as possible. One
assumption made then is that the least significant bit of the converter scales
down together with the supply voltage, leading to a smaller allowable
mismatch:

2
Index 1 is used forthe technology and index 2 for the technology.
140 Low Voltage Analog Design

The speed equation can be rewritten, now including the drain bulk
capacitance.

Again, assuming equal gate-overdrive voltages, the power drain can be


compared3:

Because ground rules do not scale at the same rate as technology


minimal-length, the last factor

is smaller than 1. This equation shows the relatively increasing power


consumption when down-scaling the technology (m > 1).
This trend towards relative increasing power consumption is also shown
in Figure 5.12 for three different cases:

a) Case 1: Supply voltage scaling and drain-bulk capacitance scaling.


Because of the increasing matching demands, the downscaling of the
supply voltage is no longer compensated, and so a straight line is the
conclusion.

3
The typical assumption of [tox=L/50] has been used in this equation.
Impact of Voltage Scaling on Trade-off in High-speed ADC’s 141

b) Case 2: Id. as case 1 but not without drain-bulk capacitance scaling. The
extra load on the driver transistors leads to a slightly increasing straight
line.
c) Case 3: No supply voltage scaling but with drain-bulk capacitance
scaling. The increasing matching properties lead to a decreasing power
consumption of the implemented converter.
To conclude, the expected power-decrease is counteracted by the more
stringent mismatch demand and the relatively increasing drain-bulk
capacitance.
When technology scales further, the becomes dominant,
leading to following equation:

which makes the case even worse (power increases as shown in Figure
5.13). This is because the scaling of the supply voltage is no longer
142 Low Voltage Analog Design

compensated by an increase in the matching properties of the technology. In


this analysis, nothing has been said about the susceptibility of the high-speed
A/D converter to substrate noise, power supply, and ground noise. These
noise sources become relatively more important if the supply voltage scales
down.
Impact of Voltage Scaling on Trade-off in High-speed ADC’s 143

5.3.1 Slew Rate Dominated Circuits vs. Settling Time


Dominated Circuits

In previous derivations and equations, the chosen speed parameter is equal to


the generic small-signal time constant. Other analog building blocks can also
exhibit a slew-rate behavior (together with a small-signal behavior). In this
section the influence of the addition of slew rate is examined.
Figure 5.14 shows the settling behavior of an operational amplifier (op-
amp) in unity feedback when a step function is applied to the input of the op-
amp, one can clearly distinguish the two speed parameters. First, there is a
slewing behavior and then a linear settling behavior is observed. To calculate
the impact of this slewing time, a simple additive model is proposed where
the speed is determined by the linear addition of the two speed parameters.
By using the following equations:

the influence of settling behavior on the speed-power-accuracy trade-off is


examined.
A parameter can now be defined by taking the ratio of the two speed
parameters:

This parameter decreases when the supply voltage decreases (for a


fixed gate-overdrive voltage).
The same equations from the previous section can now be redone but
now with both settling behavior and slewing behavior:
144 Low Voltage Analog Design

or with the introduction of the parameter

Thus, the power consumption (for equal gate-overdrive voltages) ratio is:

The power consumption trend does not stay the same (as in the case
when only the settling parameter was included) but has a sub-linear slope.
This is due to the introduction of the slewing behavior.
This trend is plotted in Figure 5.15 as a function of the technology and
for three different gate overdrive voltages (slew rate, settling behavior,
and are included). One can clearly see that for
smaller gate-overdrive voltages the power increase turn-point is pushed
towards smaller technologies. This is intuitively understood because then the
supply voltage scaling is advantageous for the power consumption because
the circuit is longer in a slewing behavior.
Lowering the gate-overdrive voltage brings a remarkable conclusion. It
indicates that for future ADCs a behavior close to the linear behaviour is
preferable for the implementation and power consumption of high-speed
ADCs.
Solutions for Low Voltage ADC Design 145

5.4 SOLUTIONS FOR LOW VOLTAGE ADC DESIGN

In the previous sections, the fundamental trade-off between speed, power,


and accuracy has been discussed. It has been shown that without other
precautions, technology scaling will increase power consumption of high-
speed A/D converters in the future. To circumvent this power increase,
modifications have to be found.
From a general viewpoint, modifications can be done on three levels: the
system level, the architectural level, and the technology level.

5.4.1 Technological Modifications

Not only analog circuits have problems with the decreasing power supply
voltage and mismatch, digital circuits also suffer from the mismatch between
identical devices, e.g., offsets in a SRAM cell. Because of the enormous
economical impact of digital circuits, maybe more effort will be spent at
extensive research to achieve much better mismatch parameters in future
technologies. Here, for once, digital demands go hand in hand with analog
demands. Another technological adaptation is the use of dual oxide
processes, which can handle the higher supply voltages necessary to achieve
the required dynamic range in data converters.
146 Low Voltage Analog Design

5.4.2 System Level

Optimized system-level design can substantially decrease the needed


performance of the data converter in the system. High-level design decisions
can have a huge impact on the speed-power-accuracy of the ADC. This high
level design needs behavioral models, including power estimators [8].

5.4.3 Architectural Level

In this section some possible architecture modifications are presented to


break through this trade-off. Two possibilities will be discussed: analog
preprocessing techniques and averaging techniques.
Analog pre-processing techniques reduce the input-capacitance of the
flash A/D converter and the number of preamplifiers. Examples are
interpolating (voltage/current), and folding. These techniques do not really
improve the speed-power-accuracy trade-off; they only decrease the input
capacitance (limiting the highest input frequency) and the number of
preamplifiers or comparators (and so decrease the power consumption of
these converters).
Averaging is a technique that reduces the offset specification for high-
speed A/D converters without requiring larger transistor areas. Averaging
was first presented in 1990 by [10], where the outputs of the differential
bipolar preamplifiers were combined by a resistive network (shown in
Figure 5.16).

This technique makes a trade-off between the improvement in DNL/INL


and the gain of the preamplifier. The latest published high-speed analog to
digital converters all use this technique to reduce the power consumption of
the implemented flash ADC [11][12][13]. In [11] it is also proven that the
averaging network also improves the speed of the preamplifier. An improved
version of this technique is presented in [14] where the improvement in
Comparison with Published ADC’s 147

DNL/INL only depends on the number of stages that contribute to the


averaging.
Averaging can be seen as taking the average value of neighboring node-
voltages and thereby reducing the offset demand. The offset of the averaged
value is equal to the original offset divided by the square root of the number
of preamplifiers that are being averaged.
A modification to this technique, called shifted averaging, was first
presented in [15]. This technique eliminates the need for averaging resistors
connecting neighboring stages, but the overall reduction in DNL/INL is
fixed.
The same principle as the one employed in shifted averaging has been
used in [16] where “reinterpolation” is performed to reduce the input-
referred offset.
In pipelined structures, error correction (digital or analog) is performed to
reduce the offset demands of the comparators.

5.5 COMPARISON WITH PUBLISHED ADC’S

To compare the developed equations and the published data, a figure (see
Figure 5.17) is presented that shows the figure of merit of several published
6-bit converters vs. their implementation technology.
148 Low Voltage Analog Design

The figure of merit used here is:

One clearly sees a good agreement between the equations and the
published data. Also the averaging technique solution is a good candidate to
circumvent this speed-power-accuracy trade-off.

5.6 SUMMARY

After a general introduction on low-voltage and low-power consumption in


analog signal processing systems and the derivation of the minimum power
consumption equation, the focus was placed on the implications of lower
supply voltages in CMOS technologies on the power consumption of high-
speed analog to digital converters. An overview of the state-of-the-art high-
speed A/D converters and their architecture was given. Mismatch models for
deep submicron technologies were discussed, followed by an analysis of the
speed, power, and accuracy trade-off in these A/D converters. This speed,
power, and accuracy trade-off is only dependent on the mismatch
specifications of the technology used for the design of the ADC. An in-depth
analysis on the influence of technology scaling (together with supply voltage
scaling) on this trade-off was made. It was shown that without extra
modifications to the design or technology, power consumption will become a
problem for future high-speed A/D converters. Some solutions to circumvent
this trade-off (and thus lower the power consumption) were discussed and
averaging was seen as the only way out of the fundamental trade-off.
The better the mismatch of devices is modeled and characterized, the
smaller the area that the designer can safely use while keeping a high circuit
yield; consequently the circuits will consume less power for the specified
accuracy and speed. Technology scales so fast that mismatch parameter
extraction and mismatch model generation must be generated in much less
time. Extrapolating mismatch data from previous processes can substantially
differ from the exact data; thus non-optimal data converter design is
completed.
Summary 149

REFERENCES

[1] E.A. Vittoz, “Future of Analog in the VLSI Environment,” ISCAS 1990, pp. 1372-1375,
May 1990.
[2] M. Pelgrom et al., “Matching properties of MOS Transistors,” IEEE Journal of Solid-
State Circuits, vol. 24, no, 5, pp. 1433-1439, Oct 1989.
[3] M.J.M. Pelgrom, A.C.J. v. Rens, M. Vertregt and M. Dijkstra, “A 25-Ms/s 8-bit CMOS
A/D Converter for Embedded Application,” IEEE JSSC, vol. 29, no. 8, Aug. 1994.
[4] J. Bastos et al., “Mismatch characterization of small size MOS Transistors,” Proc. IEEE
Int. Conf. On Microelectronic Test Structures, vol. 8, pp. 271-276, 1995.
[5] W.M.C. Sansen and K.R. Laker, “Design of analog integrated circuits and systems,”
McGraw-Hill International Editions, 1994.
[6] P. Kinget and M. Steyaert, “Impact of transistor mismatch on the speed-accuracy-power
trade-off of analog CMOS circuits,” Proceedings CICC, May 1996.
[7] Iuri Mehr and Declan Dalton, “A 500 Msample/s 6-Bit Nyquist Rate ADC for Disk
Drive Read Channel Applications” , Journal of Solid State Circuits , Sept. ’99.
[8] E. Lauwers and G. Gielen, “A power estimation model for high-speed CMOS A/D
Converters,” Proc. DATE, March 1999.
[9] Q. Huang et al., “The Impact of Scaling Down to Deep Submicron on CMOS RF
Circuits,” IEEE JSSC, Vol. 33, no. 7, July 1998.
[10] K. Kattmann and J. Barrow, “A Technique for reducing differential non-linearity errors
in flash A/D converters,” 1991 IEEE ISSCC Dig. Of Tech. Papers, pp. 170-171, Feb.
1991.
[11] Abidi et al., “A 6-bit, 1-3 GHz CMOS ADC,” IEEE ISSCC, San Francisco, Feb. 2001.
[12] P. Scholtens et al., “A 6-bit, 1-6 GHz CMOS Flash ADC,” to be presented at ISSCC, San
Francisco, Feb. 2002.
[13] G. Geelen, “A 6b, 1.1 Gsample/s CMOS A/D Converter,” IEEE ISSCC, San Francisco,
Feb. 2001.
[14] K. Bult and A. Buchwald, “An embedded 240mW 10b 50Ms/s CMOS ADC in
IEEE JSSC, Vol. 32, pp. 1887-1895, Dec. 1997.
[15] G. Hoogzaad and R. Roovers, “A 65-mW, 10-bit, 40-Ms/s BICMOS Nyquist ADC in 0.8
IEEE JSSC, Dec. 1999.
[16] Yun-Ti Wang and B. Razavi, “An 8-bit, 150-MHz CMOS A/D Converter,” Proceedings
Custom Integrated Circuits Conference, pp. 117-120, May 1999.
[17] M. Flynn and B. Sheahan, “A 400 Msample/s 6b CMOS Folding and Interpolating
ADC,” ISSCC ’98.
[18] Sanruko Tsukamoto et al., “A CMOS 6b 400 Msample/s ADC with Error Correction,”
ISSCC ’98.
[19] K. Nagaraj et al., “A 700 Msample/s 6b Read Channel A/D converter with 7b Servo
Mode,” ISSCC ’00, Feb. 2000.
[20] K. Sushihara, “ A 6b 800 Msample/s CMOS A/D Converter,” ISSCC ’00, Feb. 2000
[21] Declan Dalton et al., “A 200-MSPS 6-Bit Flash ADC in CMOS,” Journal of
Solid State Circuits, Nov. 1998.
[22] R. Roovers and M. Steyaert, “A 6bit, 160mW, 175 MS/s A/D Converter,” Journal of
Solid-State Circuits, July ’96.
[23] Yuko Tamba, Kazuo Yamakido, “A CMOS 6b 500Msample/s ADC for a Hard Disk
Read Channel,” ISSCC ’99.
This page intentionally left blank
Chapter 6
Low Power Flip-Flop and Clock Network Design
Methodologies in High-Performance System-on-a-
Chip

Chulwoo Kim1 and Sung-Mo (Steve) Kang2


1
IBM, Microelectronics Division, Austin, TX;2 University of California, Santa Curz, CA

Abstract: In many VLSI (very large scale integration) chips, the power dissipation of the
clocking system that includes clock distribution network and flip-flops is often
the largest portion of total chip power consumption. In the near future, this
portion is likely to dominate total chip power consumption due to higher clock
frequency and deeper pipeline design trend. Thus it is important to reduce
power consumptions in both the clock tree and flip-flops. Traditionally, two
approaches have been used: 1) to reduce power consumption in the clock tree,
several low-swing clock flip-flops and double-edge flip-flops have been
introduced; 2) to reduce power consumption in flip-flops, conditional capture,
clock-on-demand, data-transition look-ahead techniques have been developed.
In this chapter these flip-flops are described with their pros and cons. Then, a
circuit technique that integrates these two approaches is described along with
simulation results. Finally, clock gating and logic embedding techniques are
explained as powerful power saving techniques, followed by a low-power
clock buffer design.

Key words: Flip-flop, small-swing, low-power, clock tree, statistical power saving, clock
gating, double edge-triggered, logic embedding, clock buffer.

6.1 INTRODUCTION

6.1.1 Power Consumption in VLSI Chips

Very deep sub-micron (VDSM) technology will soon produce billion-


transistor chips. As shown in Table 6.1, future microprocessors may
consume hundreds of watts unless further improvement is made in low-
power design [1]. Power dissipation has become a critical concern due to
power density, limited battery life, and the reliability of integrated circuits
152 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

[2]. The need for cheap packaging will require further reduction in power
consumption. Heat sinks required for high-power chips occupy a large
amount of space, and a cooling fan causes extra power consumption. Also,
low-power system-on-a-chips (SoCs) are needed to meet the market demand
for portable equipment, such as cellular phones, laptop computers, personal
digital assistants (PDAs), and, soon, wearable computers.
In the near future, L di/dt noise concern is another important factor that
demands low-power consumption in high-performance microprocessors [3].
At the clock edge, large amounts of power supply current are required
instantaneously. However, inductance in the power rails limits the ability to
deliver the current fast enough, thus leading to core voltage droop. For
example, when a 1-GHz microprocessor has a 1.6 V core voltage, a 2 pH
package inductance, and a di/dt of 80 A/ns, then the first inductive voltage
droop will be about 160 mV, 10% of the core voltage. If a 10 GHz
microprocessor has a 0.6 V core voltage, a 0.5 pH package inductance, and a
di/dt of 1000 A/ns, then the first inductive voltage droop will be 500 mV,
that is 83.3% of the core voltage. To suppress L di/dt noise, various power
saving techniques are essential in future chip design.

6.1.2 Power Consumption of Clocking System in VLSI


Chips

In many VLSI chips, the power dissipation in the clocking system that
includes clock distribution network and flip-flops is often the largest portion
of total chip power consumption, as shown in Figure 6.1 [4][5][6][7]. This is
due to the fact that the activity ratio of the clock signal is unity and the
interconnect length of the clock trees has increased significantly. In Figure
6.1, hashed bars represent the power consumption in the clock distribution
Introduction 153

network (clock tree and clock buffers), and a dark bar represents the power
dissipation in the clock network and storage elements (latches and flip-
flops). The design trend for using more pipeline stages for high throughput
increases the number of flip-flops in a chip. With deeper-pipeline design,
clocking system power consumption can be more than 50% of the total chip
power consumption, and the portion will increase as the clock frequency
goes up. The clock frequency of microprocessors has been doubled every
two to three years as reported in the literature. In a recent high-frequency
microprocessor, the clocking system consumed 70% of the total chip power
[7]. Thus, it is important to reduce power consumptions in both the clock
trees and the flip-flops.

Power consumption of a particular clocking scheme can be represented as

where and represent power consumptions in the clock network


and flip-flops, respectively. Each term in Equation (6.1) is dominated by
dynamic power consumption and can be expressed as
154 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

where represents the power consumption of an individual flip-flop.


and represent interconnect line capacitance,
capacitance of the clocked transistors of flip-flop, internal node capacitance
of flip-flop, output node capacitance of flip-flop, clock swing voltage level,
internal node transition activity ratio, output node transition activity ratio,
and clock frequency, respectively. Also, is 2 for double-edge triggered
flip-flops and 1 for single-edge triggered flip-flops because of double-
edge triggered flip-flop is reduced to half compared to that of single-edge
triggered flip-flop. is 1 if each flip-flop has a local clock buffer or a short
pulse generator inside the flip-flop and 1/k if k flip-flops share a local clock
buffer or a short pulse generator, respectively. is 0 otherwise. From the
terms in equations (6.2) and (6.4), we can get ideas how to reduce power
consumption in the clocking system. Four basic approaches are reducing
node capacitance, lowering voltage swing level, removing redundant
switching activities, and reducing the clock frequency.
The hybrid-latch flip-flop (HLFF) and semi-dynamic flip-flop (SDFF)
are known as the fastest flip-flops, but they consume large amounts of power
due to redundant transitions at internal nodes [8][9]. To reduce the redundant
power consumption in internal nodes of flip-flops, several statistical power-
saving techniques have been proposed. In particular, data transition look-
ahead flip-flop, clock-on-demand flip-flop, and conditional
capture/precharge flip-flops have been proposed [10][11][12][13]. However,
they use full-swing clock signals that cause significant power consumption
in the clock tree.
To reduce power consumption in clock distribution networks, several
small-swing clocking schemes have been proposed, and their potential for
practical applications has been shown [14][15][16]. The half-swing scheme
requires four clock signals and suffers from skew problems among the four
clock signals. It also requires additional chip area [14]. A reduced clock-
swing flip-flop (RCSFF) requires an additional high power supply voltage to
reduce the leakage current [15]. A single-clock flip-flop for half-swing
clocking does not need high power supply voltage but has a long latency
[16]. As an alternate effective way of reducing power consumption in the
clock network, double edge-triggered flip-flops (DETFFs) have been
developed [16][18][19][20]. DETFFs can lower power consumption in the
High-Performance Flip-Flops 155

clock network by 50%. A low-swing clock double edge-triggered flip-flop


has been developed by merging two power-saving techniques for the clock
network [30]. In this chapter we will focus on flip-flops rather than latches
because many industry designers prefer flip-flops due to easier timing
verification although latch-based design can take the advantage of time
borrowing.
The rest of this chapter is organized as follows. Section 6.2 describes the
high-performance flip-flops and their shortcomings. In Section 6.3, we
describe the several kinds of approaches to reduce clocking power
consumption with various flip-flops followed by simulation results of several
flip-flops. In Section 6.4, we present clock gating, logic embedded flip-
flops, and low-power clock buffer design. Soft-error due to energetic alpha-
particle is also covered in Section 6.4. Each power-saving approach is
compared in Section 6.5. Finally, conclusions are drawn in Section 6.6.

6.2 HIGH-PERFORMANCE FLIP-FLOPS

As the clock frequency doubles every two to three years and the number of
gates per cycle decreases with deeper pipeline design, flip-flop insertion
overhead increases significantly. To minimize the flip-flop insertion
overhead, high-performance flip-flop design is crucial in high-speed SoC
design. Both HLFF by H. Partovi and SDFF by F. Klass shown in Figure
6.2(a) and (b) have been known as the fastest flip-flops [8][9].
Both of them are based on short pulse triggered latch design and include an
internal short pulse generator. For example, the front-end of HLFF is a pulse
generator and its back-end is a latch that captures the pulse generated in the
front-end. Figure 6.3 illustrates a short pulse generation in HLFF. At the
rising edge of the CK signal, CKbd is in “Hi” state and goes “Lo” after 3-
inverter delay (tp). Hence, a virtual short pulse, PC in Figure 6.3, is applied
to the front-end of HLFF. During the short time, tp, 3 stacked NMOS
transistors in the front-end will conduct if D is “Hi” and 3stacked NMOS
transistors in the back-end will conduct if D is “Lo.” The small
transparency window of HLFF is closely related to its hold time. Hence, the
minimum delay (3 inverter delay) between flip-flops should be guaranteed to
avoid the hold time violation. HLFF has several advantages: small D-Q
delay, negative setup time, and logic embedding with small penalty.
SDFF has similar characteristics to HLFF. A back-to-back inverter is added
at the internal node for robust operation. The back-end latch has only two
stacked NMOS transistors, which enables SDFF to operate faster than
HLFF. A NAND gate is used for conditional shutoff, which is robust with
respect to variations of sampling window compared to the unconditional
156 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

shutoff. SDFF has a negative setup time and a small logic-embedding


overhead, which will be explained in Section 6.4.2. Several modified SDFFs
were proposed and their potential for practical applications has been cited in
[21], [21]. Alpha particle hardened SDFF was used to prevent sensitive
nodes of flip-flops from the energetic alpha particles in SPARC V9 64-b
microprocessor [21]. Simplified SDFF was proposed in MAJC 5200
microprocessor (SUN Microsystems) for faster operation with less device
count [21]. The disadvantage of HLFF and SDFF is that they consume large
amounts of power due to redundant transitions at internal nodes.
Additionally, short pulse generators inside the flip-flops always toggle and
consume power.

6.3 LOW-POWER FLIP-FLOPS

In this section, several power-saving approaches to reduce power


consumptions in the clock network and the flip-flops will be described. The
Low-Power Flip-Flops 157

rest of this section is organized as follows. Section 6.3.1 describes the low
power transmission gate master-slave flip-flop and modified flip-
flop. In Section 6.3.2, four statistical power reduction techniques are
explained. Sections 6.3.3 and 6.3.4 explain power saving methodologies for
clock networks such as small-swing clocking and double-edge triggering. In
Section 6.3.5, low-swing double edge-triggered flip-flops are presented,
which combine good features of both technologies described in Sections
6.3.3 and 6.3.4. Finally, simulation results are compared in Section 6.3.6.

6.3.1 Master-Slave Latch Pairs

A master-slave latch pair with a two-phase clock can form a flip-flop. The
transmission gate master-slave latch pair (TGFF) used in the PowerPC 603 is
shown in Figure 6.4 [23]. A schematic of a modified flip-flop is
shown in Figure 6.5 [24].

Both of them are reported as low-power flip-flops [25][26]. Although the


Clk-Q delay of TGFF is smaller, the larger setup time of TGFF makes the D-
Q delay of TGFF relatively large compared to those of other flip-flops [25].
A large clock load is another drawback of TGFF. One important
consideration in TGFF design is that input wire length should not be long
enough to be corrupted by noise. Modified has a smaller clock
load compared to TGFF due to use of local clock buffers which also makes it
insensitive to variations of clock-slope. However, modified is
slower than TGFF due to stacked transistors at the first stage and at the
output driver.
158 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

6.3.2 Statistical Power Reduction Flip-Flops

Several statistical power reduction techniques are proposed to reduce


redundant power consumption in flip-flops [10][11][12][13] proportional to
input data switching activity, A data look-ahead DFF (DLDFF) was
proposed by M. Nogawa and Y. Ohtomo in the mid 1990s, and its schematic
Low-Power Flip-Flops 159

is shown in Figure 6.6 [10]. DLDFF consists of a conventional DFF without


feedback path for a master latch, a data-transition look-ahead circuit, and a
clock control circuit. Although the transistor count is more than the
conventional DFF, its area penalty is not significant because additional
transistors are very small in size [10]. A sub-nano pulse generator provides
clock pulses for DLDFF that are shared by a group of DLDFF. Whenever
input D and output Q are the same, P1 is “Lo” and CK. and CKN stay “Lo”
and “Hi,” respectively. DLDFF consumes less power if the input data
switching activity, is less than 0.6. DLDFF’s sub-nano pulse generator
consumes redundant power, and the generated pulse can be distorted during
propagation.
Hamada et al. proposed the clock-on-demand flip-flop (CODFF), which
does not need an external pulse generator as shown in Figure 6.7 [11].
CODFF consists of a latch and a clock gating circuit. Although the clock
gating circuit consumes additional power, the reduced transistor count
offsets the power and area penalties. CODFF consumes less power compared
to conventional DFF if the input data switching activity is less than 0.95.
However, speed degradation cannot be avoided in CODFF.

Conditional capture flip-flop (CCFF) has been proposed to reduce the


redundant power consumption in internal nodes of high-performance flip-
flops [12]. Figure 6.8 shows a single-ended CCFF based on HLFF. A NOR
gate is added in the first stage to eliminate the unnecessary internal node
transitions of HLFF. The area overhead due to the increased transistor count
is not significant because the channel widths of added transistors are very
small. A major merit of CCFF is that there is no speed penalty. The setup
time of CCFF is increased compared to HLFF for two reasons: large
recovery time on dipped node Q [12] and an increased sampling time for
capturing input “Lo” because input data should arrive one inverter delay
earlier than the rising edge of clock signal [27]. The conditional capture
160 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

technique needs many additional transistors for certain flip-flops such as


SDFF, which tends to offset the power savings.

Another power saving technique, a conditional precharge flip-flop


(CPFF), was proposed by Nedovic et al. and is shown in Figure 6.9 [27].
Unlike CCFF, the internal node precharge is determined by the output signal
Q. CPFF has increased setup time similar to CCFF. To solve this large setup
time problem, an alternative version of CPFF (ACPFF) was introduced with
a power consumption penalty [27]. Also, an improved CCFF was proposed
to reduce the setup time of CCFF without power consumption cost [28].

6.3.3 Small-Swing Flip-Flops

Flip-flops with the statistical power saving techniques in section 6.3.2 use
full-swing clock signals that cause significant power consumption in the
clock tree. One of the most efficient ways to save power in the clock
network is to reduce the voltage swing of the distributed clock signal.
Low-Power Flip-Flops 161

Figures 6.10, 6.11, and 6.12 show a couple of small-swing clocking flip-
flops and their multi-phase or single-phase clock signals.

Figure 6.10(a) shows the schematic of a half-swing flip-flop (HSFF).


This flip-flop is the early work that incorporated the small swing clocking
concept into the literature. This HSFF, used for a two-phase non-overlapping
clocking scheme, can reduce power consumption by 75% over the halved-
clock swing scheme. HSFF requires four clock signals, which causes it to
suffer from skew problems among the four clock signals along with
162 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

additional area, as shown in Figure 6.10(b). Two upper-swing clocks (CKP,


CKPb) are fed to PMOS transistors, and the other two lower-swing clocks
(CKN, CKNb) are fed to NMOS transistors. Hence this scheme needs a
special clock driver circuit that requires large capacitors. Also, this scheme
increases the interconnect capacitance of the clock networks and thus the
power consumption. The speed degradation and increased setup/hold time of
the half-swing scheme also cannot be avoided. While the speed penalty in
low-frequency applications can be ignored, the relative speed degradation in
high-frequency applications is quite significant.
RCSFF uses only one clock signal, but it requires an additional high-
power supply voltage for well bias control in order to reduce
the leakage current as shown in Figure 6.11 (a). Although a simple clocking
scheme can be used for RCSFF as shown in Figure 6.11(b), its cross-
coupled NAND gates cause a speed bottleneck for RCSFF.

The single clock flip-flop (SCFF) can operate with a small-swing clock
without a leakage current problem because the clock (as shown in Figure
6.12(a)) drives no PMOS transistors. It can also use a simple clocking
scheme similar to Figure 6.11(b) with a lower clock swing level. But the
peak value of the clock signal in SCFF can be reduced to half [16].
While its single clock phase is advantageous, a drawback of SCFF lies in its
long latency; it samples data at the rising edge of the clock signal and
transfers sampled data at the falling edge of the clock signal. This long
latency becomes a bottleneck for high-performance operation.
Low-Power Flip-Flops 163

6.3.4 Double-Edge Triggered Flip-Flops

Another efficient way to save power in the clock network is to reduce the
frequency of the distributed clock signal by half via double-edge triggering.
Double-edge triggered flip-flops (DETFFs) can reduce power dissipation in
the clock tree, ideally by half. It requires a 50% duty ratio from the clock in
order not to lose any performance degradation in the system. However, it is
not easy to achieve both a 50% duty ratio and the same amount of clock
skew in the rising and the falling edges of the clock. Therefore, these non-
ideal penalties are considered, the clock frequency should be adjusted as
shown in Figure 6.13.

Let the frequency of a single-edge triggered flip-flop be 1/T as shown in


Figure 6.13(a) and that of DETFF be 1/2T as shown in Figure 6.13(b) and
(c). Figure 6.13(b) has exactly the 50% duty ratio, and there is no problem
due to halved frequency. However, if the halved clock has a non-ideal
penalty of the combinational logic between DETFFs may not finish the
evaluation of the whole path within the period as shown in Figure
6.13(c). To solve this problem, clock frequency can be lowered to
as shown in Figure 6,13(d), and we can further reduce the power
consumption to with a small performance penalty of
At the halved clock frequency, it would be easier to achieve the 50% duty
ratio compared to the original clock frequency. The case for duty cycle with
164 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

>50% can be explained in a similar manner. To keep the same clock


frequency as in the single-edge triggering case, the power supply voltage can
be increased slightly to compensate for the delay penalty.
Figures 6.14, 6.15, and 6.16 show several DETFFs. Most DETFFs in the
literature consist of duplicated latches that are connected in parallel with
opposite clock phases to respond on both clock edges [16][18][19][20][29].
This causes a significant area overhead due to large transistor counts and
flip-flop power consumption. Furthermore, the speed of DETFF degrades
due to increased internal and/or output node capacitance. This makes
DETFFs unfit for high-performance applications. The amount of power
saving in the clock network using DETFF can be offset due to increased
(doubled in the worst case) clock load.

Figure 6.14 shows a DETFF proposed by Gago et al. [16]. It is composed


of two clocked cross-coupled latches (P1, P3, P4, N1, N3, N4 & P2, P5, P6,
N2, N5, N6), two clocked input buffers (P7/N7, P8/N8), and a shared output
driver. The outputs of each pair will be multiplexed by clocked transistors
(P1, P2, Nl, and N2). If CLK is “Lo,” then the right side input buffer
(P8/N8) and the left side cross-coupled latch (P1, P3, P4, Nl, N3, and N4)
will operate while the counterparts will be off. The value of node Z2 will be
inverted input data D. Since the right side cross-coupled latch is off, data at
node Z2 does not change node Q value until CLK goes “Hi.” In cross-
coupled latches, the sizes of the right side inverters (P4/N4, P6/N6) should
be smaller than those of the left ones (P3/N3, P5/N5) for proper operation.
Thus the conductance of the left inverter is smaller, and therefore node
Z1/Z2 can flip the data at node Q [16]. Careful transistor sizing is required to
get the optimum performance of this DETFF.
Low-Power Flip-Flops 165

A DETFF proposed by Hossain et al. is shown in Figure 6.15 [18]. It


consists of two D-type latches in parallel with a shared output driver. To
reduce the area overhead of previous double-edge triggered flip-flops, it uses
only NMOS transistors instead of complementary transmission gates, which
causes speed degradation. In addition, the voltage drop at the input node of
the output driver causes leakage or DC power dissipation.
Another DETFF proposed by Mishra et al. is illustrated in Figure 6.16
[19]. It consists of two true-single phase clock (TSPC) type latches and a
NAND gate. The total number of clocked transistors is reduced from that of
previous DETFF based on TSPC type latches. When CLK is “Hi,” node X2
is “Lo” and X3 is “Hi” and the output Q depends on input data D. If D is
“Hi” (“Lo”), then Q is “Hi” (“Lo”). When CLK goes “Lo,” node Yl goes
“Hi” immediately, which in turn makes node Q start to discharge. At the
same time, node X2 goes “Hi.” After one inverter delay, node X3 goes low
166 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

and output node Q starts to ramp up. This deep dip at node Q is due to
different signal path delays and cannot be avoided in this DETFF.

6.3.5 Low-Swing Clock Double-Edge Triggered Flip-Flop

Recently, a low-swing clock double-edge triggered flip-flop (LSDFF)


was proposed. It exploits a small-swing and a double-edge triggering scheme
at the same time [30]. A schematic diagram of our LSDFF is shown in
Figure 6.17. It is composed of a data-sampling front-end (P1, Nl, N3-N6,
I1-I4) and a data-transferring back-end (P2, N2, I9, I10). Internal nodes X
and Y are charged and discharged according to the input data D, not by the
clock signal. Therefore, the internal nodes of LSDFF switch only when the
input changes.
Low-Power Flip-Flops 167

LSDFF does not require a conditional capture mechanism as used in the


pulse-triggered true-single-phase-clock (TSPC) flip-flop (PTTFF) [31]. In
PTTFF, one of data-precharged internal nodes is in a floating state, which
may cause a malfunction of the flip-flop. Additionally, its internal node does
not have a full voltage swing, thus causing performance degradation. To
remove such shortcomings, two latches (I5/N7 and I6/P3) are introduced in
LSDFF as shown in Figure 6.17. The use of one inverter and one transistor
pair (I5/N7 and I6/P3) reduces the fighting current, thus reducing the latency
and power consumption [30]. Although these latches improve performance,
careful layout is required to minimize coupling noises. Noisy environment or
clock gating operation may cause data loss of LSDFF via coupling noise
and/or leakage current through N3 ~ N6. For such situations, back-to-back
inverters, instead of I5/N7 and I6/P3, are highly recommended for the robust
operation of LSDFF, which may accompany a minor speed penalty.
Avoidance of stacked transistors at the back-end of LSDFF further reduces
the latency. Like HLFF, SDFF, and CCFF, a back-to-back-inverter type
driver at the output node is used for robust operation.
The clock load in LSDFF is an NMOS transistor (N4) and an inverter
(I1), and thus in equation (6.4) is significantly reduced compared to
previous flip-flops, as shown in Section 6.3.6. Furthermore, the reduced
clock-swing technique can be easily applied without inducing static
power dissipation or a complex clocking scheme. For LSDFF, with a simple
clocking scheme, double-edge triggering can be implemented to sample and
transfer data at both the rising edge and the falling edge of the clock. At the
rising edge of the clock signal, transistor N3 and N4 are both turned on for a
short duration of to sample data, while at the falling edge of the clock
168 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

signal, N5 and N6 are turned on to sample data during Hence, the clock
frequency in equation (6.2) can be lowered to half, and, accordingly, the
clock network power consumption can be reduced by 50%.
Figure 6.18 shows the concept of the proposed clocking scheme, and
Figure 6.19 shows equivalent implementation methods. With type A, the
timing skew between CKd and CKdb can be minimized by tuning the
transistor sizes of the inverters. For type B, a pulsed-clock signal can be
generated from an additional pulsed-clock generator. Although the inverter
overhead is removed in LSDFF, degraded pulse amplitude and width may be
a problem for clock signal propagation. Type C is considered the best
method for removing timing skew with some additional power consumption.

The operation of LSDFF is explained next. In Figure 6.17, prior to the rising
edge of clock signal, CK, N3~N6 are off. When the input changes to “Hi,”
node Y is discharged to “Lo” through NMOS transistor N1, and node X
retains the previous data value “Hi.” After the rising edges of CK, N3, and
N4 are on, node X is discharged to “Lo.” This node X drives the gate of P2,
which in turn charges the output node Q to “Hi.” When the input changes to
“Lo,” node X is charged to “Hi” through PMOS transistor P1, and node Y
retains the previous data value “Lo.” After the rising edges of CK, N3 and
N4 are on, node Y is charged to and finally to by P3.
Node Y drives the gate of N2 to discharge the output node Q to “Lo.” The
operation at the falling edge of CK can be explained in a similar manner.
Low-Power Flip-Flops 169

To prevent performance degradation of LSDFF due to reduced clock-


swing, low-Vt transistors are used for the clocked transistors
(N3-N6). Subthreshold current flow of low-Vt devices will be significant in
very deep submicron (VDSM) technology and should be controlled to
reduce the leakage power consumption. In LSDFF, the leakage current of
transistors N3-N6 will be limited by a high-Vt transistor in the off position,
either P1 or Nl according to input data For the propagation
of reduced clock-swing signals, inverters with low-Vt transistors (I1-I3) can
be used along with a low-power supply voltage. Leakage currents of these
inverters are not significant for lower-power supply voltage.
170 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

The simulated waveforms of LSDFF are shown in Figure 6.20. The


simulation conditions were 1.5 V Vdd and 80°C with the clock frequency at
125 MHz. The output load capacitance was assumed to be 100 fF

6.3.6 Comparisons of Simulation Results

We have analyzed and simulated several flip-flops in a CMOS


process. As the core frequency of SoCs increases, high-performance flip-
flops will be required to reduce the flip-flop insertion overhead. Our
simulation focused on high-performance flip-flops: HSFF, SDFF, CCFF, and
LSDFF. Each flip-flop is optimized for power-delay product. The simulation
conditions were 1.5 V and 80°C with the clock frequency at 125 MHz
for LSDFF and 250 MHz for conventional single-edge triggered flip-flops to
achieve the same throughput. The output load capacitance was assumed to
be 100 fF. Comparisons of simulation results for the four flip-flops are
summarized in Table 6.2.

Figure 6.21(a) shows that LSDFF has the least power consumption when
the input pattern does not change, whereas HLFF and SDFF still incur high
power consumption even though the input stays “Hi.” For an average input
switching activity of 0.3, the power consumption of LSDFF is reduced by
28.6%~49.6% over conventional flip-flops as shown in Figure 6.21(a),
mainly due to halved clock frequency and the elimination of unnecessary
internal node transitions. Power-delay product is also reduced by
More on Clocking Power-Saving Methodologies 171

28.7%~47.8% with comparable delay. The delay


comparisons are not suitable for a relevant performance parameter because
they do not consider the setup time and, therefore, the effective time taken
out of the clock cycle [25]. Hence, the delay is used as the delay
parameter of a flip-flop. The setup time of LSDFF is negative (-35 ps),
which is an important attribute of the soft-clock edge for time borrowing and
for overcoming clock skew problems. As shown in Figure 6.21(b), an
additional 78% power savings in clock network can be achieved by the
reduced clock-swing scheme and a 50% reduction in clock frequency.

6.4 MORE ON CLOCKING POWER-SAVING


METHODOLOGIES

Clock gating, which is a very effective power-reduction technique for


inactive blocks, is explained in Section 6.4.1. A Flip-flop with logic
embedding abilities is described with simulation results in Section 6.4.2.
Tree type and the number of clock buffers inserted affect the clock skew and
power consumption significantly as shown in Section 6.4.3. Solution for
soft-error due to energetic alpha-particle and transistor width for input data
may become more important to reduce power consumption in multi-GHz
SoCs.
172 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

6.4.1 Clock Gating

Clock gating is a key technique to reduce the power dissipation of the


inactive circuit blocks and their local clock buffers. For example, the
floating-point unit (FPU) in a microprocessor occupies a large area and
consumes large power. It can be inactive for some application programs.
Most functional units in a microprocessor are not used more than 50% [34].
If a macro is inactive for long time, the macro—including a local clock
buffer, a register, and a combinational block—should be turned off by a
clock gating function as shown in Figure 6.22.

This technique has been often used successfully in various SoCs to


reduce the power consumption in flip-flops and the clock network. In Figure
6.22, a local clock buffer 1 is not clock-gated and always fires clock signal,
CLK1, which feeds n-bit flip-flops. A local clock buffer 2 is clock-gated,
and if CG_sel signal is “Lo.” then CLK2 is not fired. Hence, power
consumption from CLK2 distribution and n-bit flip-flops can be neglected.
Furthermore, dynamic power consumption of the block (combinational block
B in Figure 6.22) following the clock-gated flip-flops can be saved during
the clock-gating period. A single circuitry can be used as a clock-gating
function for a group of flip-flops. Additional skew from clock gating should
be minimized. Increased timing verification complexity is a drawback. To
control clock gating more precisely, each pipeline stage, each sub-macro or
even individual gates in a functional unit can be turned off whenever they
are not being used. This requires a sophisticated clock gating that may trade
power for performance. Furthermore, L di/dt noise and skew variations may
limit the sophisticated clock gating. Thorough analysis of efficient clock-
More on Clocking Power-Saving Methodologies 173

gating methodologies has been performed by a group of researchers in the


last decade. Detailed introduction of those methods is, however, beyond the
scope of this chapter.

6.4.2 Embedded Logic in Flip-Flops

Simple logic elements can be embedded into LSDFF to reduce overall delays
within a pipeline stage. With embedded logic in LSDFF, the overall circuit
performance can be optimized by saving a gate in critical paths. Featuring
embedded logic inside the flip-flop will become more important in terms of
power and performance due to reduced cycle time and increased flip-flop
insertion overhead. Table 6.3 shows that the speedup factor of embedded
logic in LSDFF over discrete logic ranges from 1.33 to 1.49. SDFF can also
include a logic function inside the flip-flop more easily than LSDFF because
input data feeds to only one NMOS gate for flip-flop D as shown in Figure
6.2(b). Hence, logic embedded SDFF can increase the speed of the overall
performance significantly, as seen in [9].

6.4.3 Clock Buffer (Repeater) and Tree Design

Static clock skew should be minimized to reduce clocking overhead in high-


frequency pipelined data paths. The number of clock buffers and their sizes
as well as the inserted locations play an important role in reducing clock
skew and power consumption in the clock network. Besides small-wing
clocking and the double-edge triggering scheme, several authors have
suggested how to reduce power consumption in the clock tree via power-
aware clock buffer design [35][35][37].
Inserting the optimal number of clock buffers can minimize the short
power consumption of the repeaters with tolerable skew increase [35]. Vittal
et al. propose a methodology that designs the clock tree topology and buffer
insertion concurrently [35]. The sizes of final clock drivers are very large
and can consume 25% of the total chip power [37]. Reference [38] shows an
example of power reduction in the clock network in Alpha 21264. According
to skew target, either a mesh type tree, an H-tree, a serpentine tree, or
combination can be chosen. Generally, a mesh type tree consumes the largest
power with the least clock skew. The sizing of the metal width and the buffer
174 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

for clock tree can be optimized to reduce skew [39]. Further research is
needed to reduce power consumption.

6.4.4 Potential Issues in Multi-GHz SoCs in VDSM


Technology

The small input capacitance of the flip-flop has become more important in
multi-GHz SoCs. Large input capacitance requires a bigger driver from the
previous combinational logic block, and if the driver gain is too big, the size
of the gate that feeds the driver should be increased as well. This ripple
effect may increase the size of a single pipeline stage up to 50% because of
the reduced number of the gates in a single pipeline stage in multi-GHz
SoCs. The 50% increased combinational logic block also consumes 50%
more power.
As the process technology shrinks, more soft errors will occur in flip-
flops and other circuits because of energetic alpha particles emitted from
cosmic rays and chip packages. To reduce this soft-error rate, the
drain/source area in the feedback path should be increased [33], which in
turn will consume more power. Hence, power-aware soft-error minimization
techniques will become more important.

6.5 COMPARISON OF POWER-SAVING


APPROACHES

The power saving approaches of several conventional flip-flops have been


described in the previous sections. In this section, the different approaches to
reducing the power consumption for various clocking schemes will be
summarized.
First, statistical power reduction flip-flops reduce the dynamic power
consumption of flip-flops by removing redundant internal node switching,
thus reducing in equation (6.4). For DLDFF and CODFF,
in equation (6.4) can be reduced as well because internal clock buffers are
turned off when input D and output Q are the same. Statistical power
reduction flip-flops consume less power when input data switching activity
is low, which is very common in most cases. Second, small-swing clock flip-
flops (HSFF, RCSFF, and SCFF) reduce power consumption in the clock
network by reducing the clock voltage swing. The capacitance of the clocked
transistors of the flip-flop, in equation (6.2), is also reduced in RCSFF
and SCFF. Third, the clock frequency of double-edge triggered flip-flops can
be reduced to half, which can, in turn, reduce clock network power by half.
DETFF by Hossain et al. reduced the number of clocked transistors
Comparison of Power-Saving Approaches 175

compared to previous DETFFs, which saves power consumption in the clock


tree and the previous combinational block. LSDFF uses both a low-swing
clock and a double-edge triggered operation to reduce power consumption in
the clock network. In addition, LSDFF does not have any redundant internal
node switching.
Clock gating, a very powerful and widely used technique, can reduce
power consumption in the flip-flop, clock network, and the subsequent
combinational block. In addition, logic embedding inside the flip-flop can
reduce power consumption and the overall delay of each pipeline stage.
Finally, the serpentine clock tree consumes relatively low power compared
to the mesh tree and the H-tree, and the number of clock buffer is an good
trade-off for low-power consumption and low clock skew. Table 6.4
summarizes power-saving approaches for each flip-flop type with terms in
equations (6.2) and (6.4). Short pulse generators can be shared among a
group of latch leaf cells to reduce power consumption at the cost of clock
signal distortion through propagation in DLDFF and LSDFF.

where
by Gago et al., by Hossain et al.
176 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

by Mishra et al.
NA = Not Available because internal clock buffer is not needed.
power consumption of combinational block

6.6 SUMMARY

Flip-flop design plays an important role in reducing cycle time and power
consumption. This chapter focused on power-saving techniques in flip-flops
and the clock distribution network. To summarize, the following features
should be considered to reduce the clocking power of the chip.

Statistical power reduction


Small-swing clocking
Double–edge triggering
Clock gating
Logic embedding in flip-flop
Small input capacitance
Power-aware clock buffer insertion
Trade-off in clock tree topology
Single-phase clock

Four statistical power-reduction techniques are described to reduce the


power consumption of the flip-flop. The amount of flip-flop power saving
strongly depends on input data switching activity. Power saving from these
techniques is about 26-50% compared to their counterparts. CCFF and CPFF
do not incur any speed penalty except for the increased setup time.
Three small-swing clocking flip-flops are introduced to reduce the power
dissipation of the clock network. Ideally they can reduce the power
consumption by 67-75% in the clock network compared to full swing
clocking flip-flops. However, a speed degradation and the noise tolerance of
small-swing clocking may limit the application. Another approach for clock
tree power-reduction is the double–edge triggering. Three DETFFs are
shown, and they can save clock network power up to 50%, theoretically.
LSDFF avoids unnecessary internal node transitions to reduce power
consumption. In addition, power consumption in the clock tree is reduced
because LSDFF uses a double-edge triggered operation as well as a low-
swing clock. To prevent performance degradation of LSDFF due to low-
swing clock, low-Vt transistors are used for the clocked transistors without a
significant leakage current problem. The power saving in flip-flop operation
is estimated to be 28.6 to 49.6% with additional 78% power savings in the
clock network.
Summary 177

Clock gating has been used in several commercial microprocessors to


reduce the power consumption of idle blocks. This method can reduce power
significantly in both flip-flops and clock networks. Logic embedding
capability inside the flip-flop can reduce power and overall delay and is
becoming more important due to increased flip-flop insertion overhead in a
clock period. Having an optimal number of clock buffers inserted can
minimize short power consumption of the repeaters. Single-phase clocking
can save clock tree power consumption by 30% compared to two-phase
clocking [31].
The small input capacitance of the flip-flop has become more important
in multi-GHz SoCs. As the process technology shrinks, more soft errors will
occur in flip-flops because of energetic alpha particles. To reduce this soft-
error rate, the drain/source area in the feedback path should be increased,
which in turn will consume more power. Clock tree topology and the number
of clock buffers inserted should be optimized to reduce skew and power
consumption.

REFERENCES

[1] Semiconductor Industry Association, International Technology Roadmap for


Semiconductors, 2000 update.
[2] P. Gelsinger, “Microprocessors for the new millennium: challenges, Opportunities, and
New Frontiers,” in IEEE Int. Solid-State Circuits Conf., Feb. 2001, pp. 22-25.
[3] B. Pohlman, “Overcoming the Barriers to 10GHz Processors,” in Microprocessor
Forum, Oct. 2001.
[4] R. Bechade, R. Flaker, B.Kauffmann, A. Kenyon, C. London, S. Mahin, K. Nguyen, D.
Pham, A. Roberts, S. Ventrone, and T. Voreyn “A 32 b 66 MHz 1.8 W microprocessor,”
in IEEE Int. Solid-State Circuits Conf., Feb. 1994, pp. 208-209.
[5] H. Kojima, S. Tanaka, Y. Okada, T. Hikage, F. Nakazawa, H. Matsushige, H. Miyasaka,
and S. Hanamura, “A multi-cycle operational signal processing core for an adaptive
equalizer,” VLSI Signal Process VI, pp. 150-158, Oct. 1993.
[6] P. Gronowski, W. Bowhill, R. Preston, M. Gowan, and R. Allmon, “High-performance
microprocessor design, ” IEEE J. Solid-State Circuits, vol. 33, no. 5, pp. 676-686, May
1998.
[7] C. J. Anderson, et al., “Physical design of a fourth-generation power GHz
microprocessor,” in IEEE Int. Solid-State Circuits Conf., Feb. 2001, pp. 232-233.
[8] E. Partovi, R. Burd, U. Salim, F. Weber, L. DiGregorio, and D. Draper, “Flow-through
latch and edge-triggered flip-flop hybrid elements,” in IEEE Int. Solid-State Circuits
Conf., Feb. 1996, pp. 138-139.
[9] F. Klass, “Semi-dynamic and dynamic flip-flops with embedded Logic,” in Symp. on
VLSI Circuits Digest of Technical Papers, Jun. 1998, pp. 108-109.
[10] M. Nogawa and Y. Ohtomo, “A data-transition look-ahead dff circuit for statistical
reduction in power consumption,” IEEE J. Solid-State Circuits, vol. 33, no. 5, pp. 702-
706, May 1998.
178 Low Power Flip-Flop and Clock Network Design Methodologies in SoC

[11] M. Hamada, T. Terazawa, T. Higashi, S. Kitabayashi, S. Mita, Y. Watanabe, M. Ashino,


H. Hara, and T. Kuroda, “Flip-flop selection technique for power-delay trade-off,” in
IEEE Int. Solid-State Circuits Conf., Feb. 1999, pp. 270-271.
[12] B. Kong, S.-S. Kim, and Y.-H. Jun, “Conditional-capture flip-flop technique for
statistical power reduction,” in IEEE Int. Solid-State Circuits Conf., Feb. 2000, pp. 290-
291.
[13] N. Nedovic, and V. G. Oklobdzija, “Dynamic flip-flop with improved power,” in Proc.
IEEE Int. Conf. Computer Design, Sep. 2000, pp. 323-326.
[14] H. Kojima, S. Tanaka, and K. Sasaki, “Half-swing clocking scheme for 75% power
saving in clocking circuitry,” IEEE J. Solid-State Circuits, vol. 30, no. 4, pp. 432-435,
April 1995.
[15] H. Kawaguchi and T. Sakurai, “A Reduced clock-swing flip-flop (rcsff) for 63% clock
power reduction,” IEEE J. Solid-State Circuits, vol. 33, no. 5, pp. 807-811, May 1998.
[16] Y.-S. Kwon, I.-C. Park, and C.-M. Kyung, “A new single clock flip-flop for half-swing
clocking,” IEICE Trans. Fundamentals, vol. E82-A, no. 11, pp. 2521-2526, Nov. 1999.
[17] Gago, R. Escano, and J. Hidalgo, “Reduced Implementation of D-Type DET Flip-
Flops, ”IEEE J. Solid-State Circuits, vol. 28, no. 3, pp. 400-402, Mar. 1993.
[18] R. Hossain, L. Wronski, and A. Albicki, “Low-power design using double edge triggered
flip-flop,” IEEE Tran. VLSI Syst., vol. 2, no. 2, pp. 261-265, Jun. 1994.
[19] S. Mishra, K. S. Yeo, and S. Rofail, “Altering transistor positions impact on the
performance and power dissipation of dynamic latches and flip-Flops,” IEE Proc.
Circuits, Devices and Syst., vol. 146, no. 5, pp. 279-284, Oct. 1999.
[20] J. Tscanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, and V. De, “comparative delay
and energy of single edge-triggered & dual edge-triggered pulsed flip-flops for high-
performance microprocessors,” in IEEE Int. Symp. Low-Power Electronics and Design,
Aug. 2001, pp. 147-152.
[21] R. Heald, et al., “A third-generation spare v9 64-b microprocessor,” IEEE J. Solid-State
Circuits, vol. 35, no. 11, pp. 1526-1538, Nov. 2000.
[22] Kowalczyk, et al., “The First MAJC microprocessor: a dual cpu system-on-a-chip,”
IEEE J. Solid-State Circuits, vol. 36, no. 11, pp. 1609-1616, Nov. 2001.
[23] Gerosa, “A 2.2 W, 80 MHz superscalar RISC microprocessor,” IEEE J. Solid-State
Circuits, vol. 29, no. 12, pp. 1140-1454, Dec. 1994.
[24] Y. Suzuki, K. Odagawa, and T. Abe, “Clocked CMOS calculator circuitry,” IEEE J.
Solid-State Circuits, vol. 8, no. 6, pp. 462-469, Dec. 1973.
[25] V. Stojanovic and V. G. Oklobdzija, “Comparative analysis of master-slave latches and
flip-flops for high-performance and low-power systems,” IEEE J. Solid-State Circuits,
vol. 34, no. 4, pp. 536-548, April 1999.
[26] S. Heo, R. Krashinsky, and K. Asanovic, “Activity-sensitive flip-flop and latch selection
for reduced energy,” in 2001 Conf. Advanced Research in VLSI, pp. 59-74.
[27] N. Nedovic, and V. G. Oklobdzija, “hybrid latch flip-flop with Improved Power
efficiency,” in Proc. IEEE Symp. Integrated Circuits and System Design, Sep. 2000, pp.
211-215.
[28] N. Nedovic, M. Aleksic and V. G. Oklobdzija, “Conditional techniques for low power
consumption flip-flops,” in Proc. IEEE Int. Conf. Electronics, Circuits and Systems, Sep.
2001, pp. 803-806.
[29] N. Nedovic, M. Aleksic and V. G. Oklobdzija, “Timing characterization of dual-edge
triggered flip-flops,” in Proc. IEEE Int. Conf. Computer Design, Sep. 2001, pp. 538-541.
[30] C. Kim, and S.-M. Kang, “A low-swing clock double edge-triggered flip-flop,” in Proc.
IEEE Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2001, pp. 183-186.
Summary 179

[31] D. Markovic, B. Nikolic, and R. Brodersen, “analysis and design of low-energy flip-
flops,” in IEEE Int. Symp. Low-Power Electronics and Design, Aug. 2001, pp. 52-55.
[32] J.-S. Wang, P.-H. Yang, and D. Sheng, “Design of a 3-V 300-MHz low-power 8-b % 8-b
pipelined multiplier using pulse-triggered TSPC flip-flops,” IEEE J. Solid-State Circuits,
vol. 35, no. 4, pp. 583-592, Apr. 2000.
[33] T. Karnik, B. Bloechel, K. Soumyanath, V. De, and S. Borkar, “Scaling trends of cosmic
rays induced soft errors in static latches beyond ” in Proc. IEEE Symp. VLSI
Circuits Dig. Tech. Papers, Jun. 2001, pp. 61-62.
[34] D. Brooks, P. Bose, S. Schuster, H. Jacobson, P. Kudva, A. Buyuktosunoglu, V. Zyuban,
M. Gupta, and P. Gook, “Power-aware microarchitecture: design and modeling
challenges for next-generation microprocessors,” in IEEE Micro, vol. 20, no. 6, pp. 26-
44, Nov.-Dec. 2000.
[35] V. Adler and E. G. Friedman, “Repeater design to reduce delay and power in resistive
interconnect,” in Proc. IEEE Int. Symp. Circuits and Systems, May 1997, pp. 2148-2151.
[36] Vittal, and M. Marek-Sadowska, “Low-power buffered clock tree design,” IEEE Trans.
Computer-Aided Design, vol. 16, no. 9, pp. 965-975, Sep. 1997.
[37] P. Gronowski, “Designing high performance microprocessor,” in Proc. IEEE Symp. VLSI
Circuits Dig. Tech. Papers, Jun. 1997, pp. 51-54.
[38] M. Gowan, L. Biro, and D. Jackson, “Power considerations in the design of the Alpha
21264 Microprocessor,” in Proc. Design Automation Conf., June 1998, pp. 726-731.
[39] C. Chu and D. F. Wong, “An efficient and optimal algorithm for simultaneous buffer and
wire sizing,” IEEE Trans. Computer-Aided Design, vol. 18, no. 9, pp. 1297-1304, Sep.
1999.
This page intentionally left blank
Chapter 7
Power Optimization by Datapath Width Adjustment

Hiroto Yasuura1 and Hiroyuki Tomiyama2


1
System LSI Research Center, Kyushu University;2 Institute of System and Information
Technologies /Kyushu

Abstract: Datapath width is an important design parameter for power optimization. The
datapath width significantly affects the area and power consumption of
processors, memories, and circuits. By analyzing required bit-width of
variables, the datapath width can be optimized for power minimization.
Several concepts and techniques of power minimization by datapath-width
adjustment are summarized.

Keywords: Datapath width, bit-width analysis, soft-core processor, Valen-C, compiler,


memory size, dynamic datapath-width adjustment, quality-driven design,
computation accuracy, signal processing.

7.1 INTRODUCTION

Since datapath width, the bit width of buses and operational units in a
system, strongly affects the size of circuits and memories in a system, the
power consumption of a system also depends on the width of the datapath.
In hardware design, designers are very sensitive to the width of the
datapath. Analyzing requirements on the datapath width carefully, designers
determine the length of registers and the datapath width to minimize chip
area and power consumption. In processor-based system design, it is
difficult for programmers to change the datapath width for each program. A
system designer determines the datapath width of the system, when he/she
chooses a processor. On the other hand, each application requires a different
accuracy of computation, which is requested by the specifications of input /
output signals and algorithms, and the required datapath width is sometimes
different from the width of processor’s datapath.
182 Power Optimization by Datapath Width Adjustment

Table 7.1 shows the bit width of each variable in an MPEG-2 video
decoder program [1]. The program is written in C with over 6,000 lines, and
384 variables are declared as int type. 50 variables are used as flags, and
only 1 bit is required for each of them during the computation. Only 35% of
the total bits of these 384 variables are actually used in the computation, and
65% are useless.

When a large datapath width is provided for a computation that requests


small bit width, wasteful dynamic power is consumed by meaningless
switching on the extra bits on the datapath. Furthermore, these extra bits
introduce extra leakage power consumption, which will be even more
significant in the future of advanced fabrication technology.
This chapter introduces several system-level design techniques to reduce
the wasteful power consumption by useless bits in a datapath. The basic
approach is datapath width adjustment. First, bit-width analysis is performed
to extract information on the required bit width of variables in programs and
algorithms. For hardware design, using the result of bit-width analysis, one
can determine the length of registers, the size of operation units, and the
width of memory words on the datapath of a system to minimize the
meaningless power consumption by the useless bits. For processor-based
systems, a soft-core processor with datapath width flexibility is useful. Bit-
width analysis is applied to programs running on the processor, and the
datapath width of the soft-core processor is determined. The trade-off
between power consumption and execution time needs to be resolved.
Choosing the optimum datapath width minimizes the power consumption of
Power Consumption and Datapath Width 183

a system that consists of processors and memories. Software supports, such


as compilers and operating systems, are also important for the datapath
adjustment of processor-based systems. Case studies show possibilities of
power reduction by datapath adjustment.

7.2 POWER CONSUMPTION AND DATAPATH


WIDTH

This section shows the relationship between datapath width and power
consumption. Datapath width directly affects the power consumption of
buses, operation units (such as adders, ALUs, and multipliers), and registers.
It is also related to the size of data and instruction memories of processor-
based systems.
The relations between datapath width and power consumption are
summarized as follows:
1. Shorter registers and operation units reduce switching count and the
leakage current of extra bits on the datapath.
2. Smaller circuit size induces smaller capacitance of each wire.
3. Datapath width is closely related to the size of data and instruction
memories of processor-based systems. The relationship is not
monotonic.

7.2.1 Datapath Width and Area

Datapath width is directly related to the area of datapath and memories. The
area of circuits and memories are also closely related to power consumption
because of load capacitance.
Assume a processor-based system, the datapath width of which can be
changed by system designers. As the datapath width is reduced, the area and
power consumption of processor almost linearly decreases because of the
reduction of the size of registers, buses, and operation units. The size of the
memory, which also strongly affects the power consumption of the system,
is changed drastically by the selection of the datapath width.
Generally, narrowing the datapath width reduces the area and power of
the processor, but degrades the performance. The number of execution
cycles increases, since some single-precision operations should be replaced
with double or more precision operations in order to preserve the accuracy of
the computation. Single-precision operations are those whose precision is
smaller than that of the datapath width. For example, an addition of two 32-
bit data is a single-precision operation on processors whose datapath width is
184 Power Optimization by Datapath Width Adjustment

equal to or greater than 32 bits, while it is a double precision operation on


16-bit processors.
Changing the datapath width affects the size of data memory (RAM) and
instruction memory, which is mostly implemented by ROM in embedded
systems. Let us consider a program including two variables x and y, and
assume that two variables x and y require at most 18 bits and 26 bits,
respectively (see Figure 7.1). When the datapath width is 32 bits, two words
are required to store these two variables, and the amount of the data memory
is 64 bits. Since the minimum bit size required to store the variables is only
44 bits (18 + 26), 20 bits of the memory (about 30%) are unused. By
reducing the datapath width to 26 bits, one can reduce the unused bits to 8
bits. Unused bits, however, increase to 31 bits, if a 25-bit datapath is
adopted, because y requires two words. When the datapath width is 9 bits,
two words and three words are required for x and y, respectively, and the
unused area is only 1 bit. As shown in Figure 7.1, RAM size does not
decrease monotonically with the reduction of the datapath width[2]. Many
unused bits in the data memory can be eliminated by datapath-width
optimization.
On the instruction memory, the total memory size is calculated by
multiplying the instruction word length by the number of instructions stored.
When the datapath width is reduced, the number of instructions increases
because of the increase of multiple precision operations. For example, an
addition of 20-bit data is executed by one instruction on a processor with a
20-bit datapath. When the datapath width is 10 bits, two instructions, which
are additions of lower (less significant) 10 bits and higher (more significant)
10 bits, are required (see Figure 7.2). Furthermore, LOAD and STORE
instructions may be additionally required because of the shortage of
Power Consumption and Datapath Width 185

registers. The size of instruction memory grows monotonically as the


datapath width narrows, if the instruction word length does not change.

7.2.2 Energy Consumption and Datapath Width

When the datapath width is reduced, the multiple precision operations


increase. For example, the following addition is a single-precision operation
on the n-bit processor.

int x,y,z;/*n bits*/


z=x+y

If the datapath width is m bits, the above addition is


translated into the following two additions of lower m bits and higher m bits.

int x, y, z; /* n bits */
z_low = x_low + y_low /* m bits */
z_high = x_high + y_high + carry /* m bits */

As shown above, if the datapath width is narrower than n bits, extra


instruction cycles are needed to preserve the accuracy of the computation.
This means that the datapath is used more than twice, and the total energy
consumption may increase even if one can reduce the power consumption of
a single instruction cycle. The total energy consumption for a given
186 Power Optimization by Datapath Width Adjustment

computation needs to be discussed, not the maximum or average power


consumption for a single instruction cycle.
Energy consumption changes in a non-monotonic manner with an
increase in datapath width. In general, if the datapath is too narrow, energy is
increased because of the increased execution cycles. A penalty needs to be
paid as instruction fetches and controls. On the contrary, if the datapath is
too large, energy is also increased due to the increased wasteful switching
and leakage current on the datapath, which have not contributed to the
computation.
Now, an energy minimization problem by datapath-width adjustment can
be defined.
For a given program and a set of input data, determine the datapath
width of a system consisting of a processor and memories such that the
energy consumption for the execution of the program with the data set is
minimized.

7.2.3 Dynamic Adjustment of Datapath Width

Processor-based systems treat various data with different bit width. It is


efficient in power reduction to control the active datapath width
dynamically. A simple technique based on this approach is proposed in [3].
Prepare several set of instruction sets for the processor. Each instruction set
treats data with a fixed bit width different form the others. For example,
suppose that 2 instruction sets for 8 bits and 32 bits are prepared, and
operations for data with bit-width less than or equal to 8 use instructions
from an 8-bit instruction set. On the datapath, buses and operation units, the
higher 24 digits are unchanged during 8-bit operation.
Consider the following three data transferred on a bus consecutively.

A 1100 1101 0010 1010 0000 0101 0011 0001


B 0000 0000 0000 0000 0000 0000 1011 0101
C 01011001 0000 1110 0110 0111 1010 1110

In an ordinary datapath, the total switching count is 28. If the two


instruction sets mentioned above are used, the data sequence will be as
follows:

A1100 1101 0010 1010 0000 0101 0011 0001


B 1100 1101 0010 1010 0000 0101 1011 0101
C 0101 1001 0000 1110 0110 0111 1010 1110
Bit-Width Analysis 187

The total switching count is reduced to 14. This approach requires an


extra bit for the instruction format to identify two instruction sets but there is
a possibility of a large power reduction.
For the program SPLIT and SORT in Unix, 16% and 28% power
consumption is reduced, respectively, introducing an 8-bit instruction set to a
32-bit RISC processor [3]. The increase of gate count of the instruction set
extension is less than 10%. The assignment of instructions from two
instruction sets can be performed in compile time.
There are many different approaches for dynamic datapath-width
adjustment. Further research and architectural techniques will be proposed.

7.3 BIT-WIDTH ANALYSIS

To adjust the datapath width for power reduction, information on the bit-
width requirement of each variable is very important. Popular programming
languages, however, have no feature to treat detailed information on the bit
width of variables. System designers and programmers are not concerned
about the size of variables except for selection of data types such as int,
short, and char of the C language.
In the design phase of algorithms and programs, designers want to
concentrate their attention on the design of system functionality. Information
on the bit width of variables has low priority, though it is very useful for
power optimization. It is desirable for the bit width of each variable to be
automatically induced from the descriptions of algorithms and programs.
The bit-width analysis is defined as follows:
For a given program, a set of input data and requirements of computation
accuracy e.g., quality of output), find the bit width of every variable in the
program required to keep sufficient information during the computation for
the input data set satisfying the accuracy requirement.
Several bit-width analysis techniques have been developed [4][5][6][7].
Using the techniques, an ordinary program is automatically analyzed, and
the bit width of each variable required for computation is specified in the
program. Thus, programmers do not have to care about the variable bit
width.
There exist two approaches to analyze the variable bit width [4]. One is
dynamic analysis in which one executes a program and monitor the value of
each variable. The other approach is static analysis, in which variable bit-
width is analyzed by formal rules without the execution of the program.
In the static analysis, rules to compute the ranges of variables after
executing basic operations are prepared. The analysis is performed both
forward and backward [7]. In the forward analysis, for an assignment
188 Power Optimization by Datapath Width Adjustment

statement with arithmetic operations, the range of a variable in the left side is
calculated from ranges of variables and constants according to the rules
applicable for operations in the right side. Starting from ranges of input data,
one can calculate the range of every variable in the program by a technique
of symbolic simulation. The backward analysis is also performed from the
ranges of outputs.
For example, consider the following addition statement.

z=x+y

If the ranges of x and y are [0, 2000] and [30, 500], respectively, the
range of z is [30, 2500]. Thus, 12 bits are required for variable z.
Static analysis is an efficient method to analyze the variable bit width.
However, in many cases when the assigned value of a variable cannot be
predicted unless the program is executed, such as in the case of unbounded
loops, static analysis is insufficient. As a solution to this problem, the
dynamic analysis is used in combination with the static analysis.

7.4 DATAPATH WIDTH ADJUSTMENT ON A SOFT-


CORE PROCESSOR

Traditionally, processor cores in processor-based embedded systems are


widely used as a hard macro. The layout data of a processor core is copied
into the system-on-a-chip design shown here. The core processor is called a
hard-core processor. In the hard-core processor approach, it is difficult to
modify the function and structure of the processor itself. The datapath width
of a processor-based system is determined by the datapath width of the core
processor, and it is difficult to adjust the datapath width for each application.
To increase the design flexibility, the approach of parameterization of
core processors can be considered. The items on parameterization will be the
number of registers, the word length of data and/or instructions, the number
of operation units, and so on. Parameterizing the modification of the
instruction set and function of the processor is also discussed. The
parameterized processor is called a soft-core processor, in which function
and structure of the processor can be changed in RTL lever description.
In this section, an example of soft-core processors is presented, the
datapath width of which is parameterized and can be adjusted for each
application [8]. The soft-core processor is a core processor, which can be
redesigned by system designers. The soft-core processor is presented in the
forms of a fabricated chip, layout data, net list in logic circuit level, and RTL
description in HDL. Design modification is done mainly in the HDL
Datapath Width Adjustment on a Soft-core Processor 189

description rather than in net list or layout levels. A customized processor is


obtained through the redesign process, utilizing synthesis tools. Design
parameters for tools of logic and layout synthesis are also provided for
prompt re-synthesis after the modification.
As an implementation of the soft-core processor, Bung-DLX is designed
based on DLX RISC architecture [9]. The original Bung-DLX has non-
pipeline RISC architecture with 32 general registers and 72 instructions. The
length of data and instruction words is 32 bits. The address spaces of data
memory and instruction memory are both 232. It is described by a VHDL
code with about 7,000 lines. The gate size after logic synthesis is 23,282
gates. The design modification table includes the width of the datapath, the
amount of data memory, the length of the instruction word, the amount of
instruction memory, the number of registers, and the instruction set itself.
Bung-DLX is now provided in the form of a VHDL description together
with a simulator, an assembler, and a compiler. The design flow of an SOC
using the soft-core processor is summarized in Figure 7.3.
In order to adjust the datapath width for a given application program, an
extended C language, called Valen-C (Variable Length C) was developed
[10]. Valen-C enables system designers to explicitly specify the required bit
width of each variable in the program. Even if system designers customize
the datapath width for their application, the Valen-C compiler preserves the
semantics and accuracy of the computation. Therefore, Valen-C programs
can be reused on processors with various datapath widths. Valen-C is one
solution for the problem of variable bit width support in C. The control
structures in Valen-C, such as “if” and “while” statements, are the same as in
C.
C provides for three integer sizes, declared using the keywords short, int,
and long. The sizes of these integer types are determined by a compiler
designer. In many processors, the size of short is 16 bits, int is 16 or 32 bits,
and long is 32 bits. On the other hand, in Valen-C, programmers can use
more data types. For example, if a variable x needs a bit width of 11 bits, x
will be declared as “int11 x. ” Using this notation, programmers can
describe information on the bit width of variables explicitly in Valen-C
programs. A bit-width analysis tool converts a C program into a Valen-C
program [4].
If processor architecture is modified, the compiler for the processor also
needs to be modified. For datapath-width adjustment, a retargetable compiler
that can treat any datapath width is required. In cases where the bit width of
an operation is larger than the datapath width, the compiler has to translate
the operation into a certain number of machine instructions for a multiple
precision operation. The Valen-C compiler takes a Valen-C program and
190 Power Optimization by Datapath Width Adjustment

values of parameters of Bung-DLX as inputs and generates assembly code


for the modified processor.
Figure 7.4 shows an example of the compilation of a Valen-C program on
a 10-bit processor. The example assumes that the sizes of short, int, long,
and long long are 5 bits, 10 bits, 20 bits, and 30 bits, respectively. In the
Valen-C to C translations phase, the variable flag is mapped into short since
the bit width is less than 5 bits. The int type is used for x and y because their
bit widths are not less than 10 bits and less than 20 bits. The long type is
assigned to z and w. In the code generation phase, the equation z = x + y is
performed by two addition instructions of lower 5 bits and higher 5 bits with
a carry bit. Similarly, the assignment z = w is also divided into two
instructions.
In the rest of this section, three examples of datapath width adjustment
using Bung-DLX and Valen-C are presented. The following three
applications are used: a 12-digit decimal calculator (Figure 7.5), a Lempel-
Ziv encoder/decoder (Figure 7.6), and an ADPCM encoder (Figure 7.7). For
each of the three applications, performance (in terms of execution cycles),
Datapath Width Adjustment on a Soft-core Processor 191

chip area (including CPU, data RAM and instruction ROM), and energy
consumption while varying datapath width were estimated.
192 Power Optimization by Datapath Width Adjustment

Performance, cost, and energy largely depend on the datapath width.


From the estimation results, one can find the optimal solution of the datapath
width. Note that each application has different characteristics on
performance, area, and energy. Thus, the optimal datapath width varies
depending on the applications and requirements.
Case Studies 193

7.5 CASE STUDIES

7.5.1 ADPCM Decoder LSI

Two ADPCM decoder ASICs were designed. The designs are hardware-
direct implementations of an ADPCM decoder not including processor
cores. The design started from an ADPCM decoder program written in C,
which is a part of the DSPstone benchmark suite. Next, the required bit
width of variables in the program was statically analyzed. The analysis
results are shown in Table 7.2. There are eight int-type variables in the
program, which are all 32 bits in the original. The results show that no
variable requires the precision of 32 bits or more. The size of the largest
variable is only 18 bits.

Based on the results, two ASICs for the ADPCM decoder were designed.
One has a 32-bit datapath, (ADPCM 32) and the other has an 18-bit one
(ADPCM 18). Since no high-level synthesis tool was available, we manually
designed the ASICs in VHDL. Then, logic synthesis was performed with
Synopsys Design Compiler and 0.5µm, standard cell technology. The
194 Power Optimization by Datapath Width Adjustment

synthesis results are summarized in Table 7.3. With datapath-width


adjustment, chip area and energy consumption were significantly reduced
(by 49% and 35%, respectively). Figure 7.8 shows photos of the chips
designed. The result shows that information obtained by bit-width analysis is
very important, and the effect of datapath adjustment in hardware design is
significant.

7.5.2 MPEG-2AAC Decoder

The second example is an MPEG-2 AAC decoder, which is a voice decoder


based on the ISO/IEEE 13818-7 standard. A practical C program used in
consumer products is supplied from a company. The size of the program is
8,575 lines, and all variables are defined as integers. In this case, Bung-DLX
and Valen-C are used for the implementation. First, the AAC decoder
program is analyzed using bit-width analysis tools, both dynamically and
statically. The analysis results are shown in
Table 7.4. There are 133 int-type variables and 5 int-type arrays including
10,498 words, which are all 32 bits in original. Our bit-width analysis result
shows that no variable requires the precision of 24 bits or more.
Based on the results, two AAC decoder chips were designed. One is
implemented with 32-bit original Bung-DLX processor and other one with
24-bit Bung-DLX. The C program was translated to Valen-C using the result
of bit width analysis. Design was done based on the design flow in Figure
7.3. Logic synthesis of Bung-DLX was performed with Synopsys Design
Case Studies 195

Compiler and standard cell technology. The synthesis results are


summarized in Table 7.5. With datapath-width adjustment, chip area and
energy consumption were reduced by 27% and 10%, respectively.

7.5.3 MPEG-2 Video Decoder Processors

In the third case study, an MPEG-2 video decoder was examined [1]. The
MPEG-2 decoder program was obtained from the MPEG Software
Simulation Group. In this design, Bung-DLX and Valen-C were also used.
The original program consists of over 6,000 lines of C code. We analyzed
the required bit-width of 384 int type variables, and the results are
summarized in Table 7.1. Based on the results, we translated the C program
into Valen-C one.
The datapath width of the processor was changed from 17 bits to 40 bits,
and the performance (in terms of execution cycles), gate count, and energy
consumption was estimated. The results are depicted in Figure 7.9. From the
figure, one can see that the chip area increases in a monotonic fashion with
196 Power Optimization by Datapath Width Adjustment

the datapath width. Execution cycles are minimized at a 28-bit datapath but
are not decreased further for larger bitwidth. Note that smaller datapath
bitwidths have shorter critical-path delays. This means that, in the MPEG-2
example, performance is maximized at a 28-bit datapath. Energy
consumption is also minimized at 28 bits. For datapaths shorter than 28 bits,
more energy is required because of larger execution cycles. On the other
hand, for datapaths larger than 28 bits, wasteful switches on the datapath
increase, and extra energy is consumed.

7.6 QUALITY-DRIVEN DESIGN

For SOC design, the bit width of data computed in a system is one of the
most important design parameters related to performance, power, and cost of
the system. The datapath width and size of memories strongly depend on the
bit width of the data. System designers often spend much time analyzing the
bitwidth of data required in the computation of a system. Hardware
designers of portable multimedia devices reduce datapath width [11].
Programmers of embedded systems sometimes work hard for adjustment of
the bit width of a variable to keep the accuracy of computation. By
controlling the datapath width, one can reduce area and power consumption
drastically. Furthermore, one can choose the computation precision actually
required for each application to further optimize application-specific design.
Quality-Driven Design 197

In video processing, for instance, the required qualities of video, such as


resolution and levels of color, strongly depend on the characteristics of
output display devices. One can reduce the computation precision in a target
application program if the reduction does not induce a decrease of output
quality. This means that a video system with the minimum hardware and
energy consumption can be designed by eliminating redundant computation.
This design methodology is called Quality-driven Design (QDD) [12].

Figure 7.10 shows the flow of the presented QDD for video decoders. In
the first phase of a system design, the implementation of the functionality of
the system and optimization for general constraints, performance, power,
and cost are performed. Initial designs are written in a high-level language,
such as C, in which most variables are assumed to be 32 bits. After the
function design is validated and verified, the second phase for application-
specific optimization is performed. In this phase, the bit width of variables
in the application program is analyzed, the design parameter is tuned, the
output quality and computation precision are adopted, and datapath-width
adjustment is performed under the given quality constraint. Using QDD, one
can design various video applications with different video quality from the
same basic algorithm.
In QDD, both higher and lower bits of data can be reduced. From the
requirements on the output quality, lower bits of data may be omitted in the
datapath-width adjustment (See Figure 7.11). This means that there is
potential for further energy reduction by decreasing computation accuracy.
198 Power Optimization by Datapath Width Adjustment

The computation accuracy of a signal-processing program is sometimes over


specification from the viewpoint of performance of an output device. One
can reduce energy consumption for small displays an cheap speakers.
Using QDD, one can design different systems under given quality
constraints. This approach is very effective and hopeful because it can
reduce many redundancies, which results in a drastic reduction of power
consumption and areas of hardware.

7.7 SUMMARY

This chapter presented methodologies, techniques, and tools for datapath-


width adjustment for power and energy reduction in the design of SOCs. The
case studies proved that a design could optimize the performance, cost, and
energy trade-off by adjusting the datapath width. The bit-width analysis for
variables gives very important information for the datapath-width
adjustment.
Quality drive Design will be a direction of the future design method for
the signal-processing domain. The datapath-width adjustment may be a
powerful technique in QDD.
Summary 199

REFERENCES

[i] Y. Cao and H. Yasuura "A system-level energy minimization using datapath
optimization," International Symposium on Low Power Electronics and Design, August
2001.
[2] B. Shackleford, et al, "Memory-CPU size optimization for embedded system designs," in
Proc. of 34th Design Automation Conference (34th DAC), June 1997.
[3] T. Ishihara and H. Yasuura, "Programmable power management architecture for power
reduction," IEICE Trans, on Electronics, vol. E81-C no. 9, pp.1473-1480, September
1998.
[4] H. Yamashita, H. Yasuura, F. N. Eko, and Yun Cao, "Variable size analysis and
validation of computation quality," in Proc. of Workshop on High-Level Design
Validation and Test, HLDVTOO, Nov. 2000.
[5] M. Stephenson, J. Babb, and S. Amarasinghe, "Bitwidth analysis with application to
silicon compilation," Conf. Programming Language Design and Implementation, June
2000.
[6] M.-A. Cantin and Y. Savaria, "An automatic word length determination method," in
Proc. of The IEEE International Symposium on Circuit and Systems, V53-V56, May.
2001.
[7] S. Mahlke, R. Ravindran, M. Schlansker, R. Schreiber, and T. Sherwood, "Bitwidth
cognizant architecture synthesis of custom hardware accelerators," IEEE Trans. CAD,
vol. 20, no. 11, pp. 1355–1371, Nov. 2001.
[8] H. Yasuura, H. Tomiyama, A. Inoue and F. N. Eko, "embedded system design using
soft-core processor and Valen-C," IISJ. Info. Sci. Eng., voL 14, pp.587-603, Sept 1998.
[9] F. N. Eko, etal., "Soil-core processor architecture for embedded system design," IEICE
Trans. Electronics, voL E81-C, no. 9,1416-1423, Sep. 1998.
[10] Inoue, et al. "Language and compiler for optimizing datapath widths of embedded
systems," IEICE Trans. Fundamentals, vol. E81--A, no. 12, pp. 2595-2604, Dec. 1998.
[11] C.N. Taylor, S. Dey, and D. Panigrahi, "Energy/latency/image quality tradeoffs in
enabling mobile multimedia communication," in Proc, of Software Radio: Technologies
andServices. EnricoDelRe, Springer VerlagLtd,, January2001.
[12] Y. Cao and H. Yasuura, "Video quality modeling for quality-driven design," the 10th
Workshop on System and System Integration of Mixed Technologies (SASIMI 2001),
Oct. 2001.
This page intentionally left blank
Chapter 8
Energy-Efficient Design of High-Speed Links

Gu-Yeon Wei 1 , Mark Horowitz2, Jaeka Kim2


1 2
Harvard University; Stanford University

Abstract: Techniques for reducing power consumption and bandwidth limitations of


inter-chip communication have been getting more attention to improve the
performance of modern digital systems. This chapter begins with a brief
overview of high-speed link design and describes some of the power vs.
performance trade-offs associated with various design choices. The chapter
then investigates various techniques that a designer may employ to reduce
power consumption. Three examples of link designs and link building blocks
found in the literature present energy-efficient implementations of these
techniques.

Key words: High-speed I/O, serial links, parallel links, phase-locked loop, delay-locked
loop, clock data recovery, low-power, energy-efficient, power-supply
regulator, voltage scaling, digital, mixed-signal, CMOS.

8.1 INTRODUCTION

Aggressive CMOS technology scaling has enabled explosive growth in the


integrated circuits (IC) industry with cheaper and higher-performance chips.
However, these advancements have led to chips being limited by the chip-to-
chip data communication bandwidth. This limitation has motivated research
in the area of high-speed links that interconnect chips [1] [2][10][3][4] and
has enabled a significant increase in achievable inter-chip communication
bandwidths. Enabling higher I/O speed and more I/O channels improves
bandwidth, but this can also increase power consumption, which eats into the
overall power budget of the chip. Furthermore, complexity and area become
major design constraints when trying to potentially integrate hundreds of
links on a single chip. Therefore, there is a need for building energy-efficient
high-speed links with low design complexity.
Power in synchronous CMOS digital systems is dominated by dynamic
power dissipation, which is governed by the following well-known equation:
202 Energy-Efficient Design of High-Speed Links

where is the switching activity, is the total switched capacitance,


is the supply voltage, is the internal swing magnitude of signals
(usually equals for most CMOS gates), and is the frequency of
operation. And since power is the rate of change of energy,

Power consumption in analog circuits is simply set by the static current


consumed such that Technology scaling enables lower
power and energy in digital systems since the next generation process scales
both capacitance and voltage. Transistors also get faster; thus it is possible to
run a scaled chip at higher frequencies while still dissipating less power.
Aside from technology scaling, reducing just the supply voltage for a
given technology enables significant reduction in digital power and energy
consumption since both are proportional to the supply voltage squared.
However, voltage reduction comes at the expense of slower gate speeds. So,
there is a trade off between performance and energy consumption.
Recognizing this relationship between supply voltage and circuit
performance, dynamically adjusting the supply voltage to the minimum
needed to operate at a desired operating frequency enables one to reduce the
energy consumption down to the minimum required. This technique is
referred to as adaptive power-supply regulation and requires a mechanism
that tracks the worst-case delay path through the digital circuitry with respect
to process, temperature, and voltage in order to determine the minimum
supply voltage required for proper operation. Although it was first applied to
digital systems, adaptive supply regulation can also enable energy-efficient
high-speed link design. It is one of several energy-reduction techniques that
will be investigated in this chapter.
The design of energy-efficient links relies on optimizing all components
of the interface. This optimization requires an analysis of each component
comprising the link and making the right power/performance trade-offs. In
order to understand these trade-offs, Section 2 presents an overview of link
design. Then, Section 3 investigates several approaches used in digital
systems that can also be applied to build energy-efficient links. It begins
with concepts utilizing parallelism to reduce power consumption.
Subsequently, an adaptive supply-regulation technique is introduced that
offers a scheme for optimizing energy consumption in the overall link
architecture. Section 4 presents implementation details of various test chips
Overview of Link Design 203

that employ many of the techniques described in Sections 2 and 3 to build


energy-efficient serial links and link building blocks.

8.2 OVERVIEW OF LINK DESIGN

High-speed links can provide high communication bandwidths between


chips and consist of four major components as shown in Figure 8.1. A
serializer converts parallel data bits into a high-speed serial bit stream that
sequentially feeds a transmitter. The transmitter then converts the digital
binary data into low-swing electrical signals that travel through the channel.
This channel is normally modeled as a transmission line and can consist of
traces on a printed circuit board (PCB), coaxial cables, shielded or un-
shielded twisted pairs of wires, traces within chip packages, and the
connectors that join these various parts together. A receiver then converts the
incoming electrical signal back into digital data and relies on a timing-
recovery block to compensate for delay through the channel and accurately
receive the data. A de-serializer block converts the received serial bit stream
into parallel data and re-times the data to the clock domain of the rest of the
digital system that consumes it.
204 Energy-Efficient Design of High-Speed Links

Links commonly used in modern digital and communication systems can


be categorized into two forms – parallel and serial links. High-speed serial
links are better suited for applications that are pin- and channel-limited such
as the backplane communication in router and switch boxes [5][6][7]. They
are also used as components in multi-gigabit optical links [8] [9]. Serial links
tend to communicate over long distances and therefore emphasize
maximizing bits per second through a single channel with considerable effort
required to overcome non-ideal channel characteristics.
Other systems that require high-bandwidth communication between
chips, with less stringent restrictions on pin and channel resources, can
utilize several parallel sets of these data links. One implementation example
of this type of interface is called a source-synchronous parallel interface [3]
and is presented in Figure 8.2. It relies on a separate clock signal for
accurate timing recovery, which is shared by the parallel links, and requires
that delays through each channel match one another. This can be achieved
through careful matching of the channel lengths and augmented with delay-
compensation schemes to account for residual mismatches [11]. Since the
basic components comprising both parallel and serial links are the same, as
the chapter delves into design details and issues no distinction will be made
as to whether they are for serial or parallel links unless discussed explicitly.
In order to understand link operation and investigate ways to reduce
energy consumption, this section begins with a review of the different
figures of merit that govern high-speed link performance. Then, the chapter
investigates how different design choices affect power and performance in
each of the components described in the following subsections. It is
important to note the design choices to be made are governed by the specific
environment, technology, and system-level requirements of each design.
Therefore, trade-offs are presented so that the designer can make the
appropriate compromises to converge on an energy-efficient design.

8.2.1 Figures of Merit

The performance and reliability of high-speed links depends on several


figures of merit. Besides the raw number of bits transmitted per second, the
quality of the signal determines whether the receiver can accurately decipher
the waveform back into digital data. If the signal is indecipherable then the
bit rate is meaningless. Therefore, there needs to be a way of looking at the
received signal and determining its quality. One can look at an eye-diagram
using voltage and timing margins as quantitative measures of link quality,
which can be used as metrics for comparing performance trade-offs. Lastly,
bit-error rate is another figure of merit for a link’s ability to reliably transmit
and receive data.
Overview of Link Design 205

Figure 8.3 presents eye-diagrams for ideal and real links, where the x-
axis spans two bit times in order to show both leading and falling transitions
of the data signal. For a random data sequence, there are both falling and
rising transitions at each bit interval. While the data levels and bit intervals
are clearly defined for the ideal case, real systems suffer from process
variability, environmental changes, and various noise sources that interact
with the signal to blur (or close) the eye. Notice that the high and low
voltage levels are no longer well-defined levels but occur over ranges. The
same holds true for the transition times. Qualitatively, larger eye openings
represent more reliable links. Quantitatively, one can apply two metrics to
measure its quality – voltage margin and timing margin. The vertical eye
opening, measured in the middle, determines how much voltage margin the
receiver has in determining whether the received signal is a high- or low-
level. The horizontal opening provides a measure of how well the receiver
can decipher one data bit from the next. Due to the finite slope of edge
transitions, reduction in voltage margin also leads to narrower timing
margins.
Besides environmental variation and noise in the transceiver circuits,
there are non-idealities in the channel that degrade signal quality. Therefore,
an eye-diagram at the receiver presents a more realistic picture of link
performance than one measured at the transmitter. Unfortunately, even
measuring at the receiver does not provide the whole picture. There can be
voltage and timing offsets in the receiver and the designer must subtract
these offsets from the measured margins. Furthermore, since the
measurement occurs over a finite time interval, it cannot fully capture the
effects of unbounded random noise sources (e.g., thermal noise, 1/f noise,
device noise, etc.) that are represented as probabilistic distributions with
infinite tails. So instead of relying only on margins, designers present link
reliability in terms of the bit-error rate (BER), which is the probability that
an error will occur with some frequency. This probability is an exponential
206 Energy-Efficient Design of High-Speed Links

function of the excess signal margins divided by the RMS distribution of the
random noise sources [12]. Increasing margins and reducing noise improves
BER but may come at the expense of higher power consumption. Therefore,
understanding and making the right trade-offs between performance and
power is important. Let us take a look at what some of these trade-offs are
by reviewing the operation of the link components, beginning with the
transmitter.

8.2.2 Transmitter

The transmitter converts binary data into electrical signals that propagate
through an impedance-controlled channel (or transmission line) to a receiver
at the opposite end. This conversion must be done with accurate signal levels
and timing for a reliable high-speed communication link. Link designers
commonly use high-impedance current-mode drivers in single-ended or
differential configurations, and there are various choices for terminating the
signals through the impedance-controlled channel. This subsection
investigates these different transmitter options and looks at how they impact
power/energy consumption. Lastly, controlling the slew rate of the
transmitted signal is desirable for minimizing noise coupling into the
channel. Since lower noise solutions enable lower power, this section
presents several techniques for slew-rate controlled transmitters. The
discussion will start with a single-ended high-impedance driver.

8.2.2.1 High-impedance Drivers

A high-impedance driver utilizes a current source switch operating in


saturation to push signals through a channel as shown in Figure 8.4.
Characteristics of the signal transmission depend on the choice of
termination used. The simplest scenario is to use a matched-impedance
terminator at either the transmitter or receiver side of the link. With
transmitter-only termination, a voltage divider is formed at the source and a
voltage waveform, with amplitude set by propagates down
the channel. Assuming a perfect open-circuit termination at the receiver,
with reflection coefficient the waveform amplitude doubles at the
receiver. Then the signal reflects back to the source, and its energy is
absorbed by the termination resistor at the transmitter, given that its
impedance matches that of the channel. Receiver-only termination behaves
similarly, except that a current is transmitted through the channel and the full
voltage amplitude, is seen at the receiver. In either case, the
same voltage amplitude of the signal is seen at the receiver. However,
Overview of Link Design 207

utilizing single termination on either the transmitter or receiver side has


some disadvantages stemming from non-idealities in implementation.

Achieving perfect matched-impedance termination can be difficult due to


tolerances in the channel and components. There may also be discontinuities
in the channel due to package parasitics and connectors. These non-idealities
lead to amplitude noise as energy sloshes back and forth through the channel
arising from imperfect termination. Other discontinuities throughout the
channel exacerbate this situation. Consequently, designers use double
termination with matched-impedance termination resistors on both sides of
the link. In that case, the energy of the transmitted voltage waveform is
absorbed at the receiver with amplitude seen at the receiver.
Although the swing amplitude is now smaller, residual energy that sloshes
back and forth due to impedance mismatches attenuates twice as quickly
since the energy is absorbed on both sides. Hence, the signal to noise ratio
(SNR) can be smaller with double termination.

8.2.2.2 Single-ended vs. Differential

So far, this chapter has looked at a single high-impedance driver that


transmits an analog waveform through the channel. In order to convert this
signal back into data bits, its voltage and timing characteristics must be
known. More specifically, one needs some voltage and timing references
with respect to which the signal can be deciphered as a logical "1" or "0"
and distinguish adjacent bits (timing issues will be discussed in Section 8.4).
In single-ended links, unless fixed transmitted voltage levels with a common
reference such as Vdd or Ground are known and shared by both sides of the
link, an additional voltage reference is required. This additional reference,
set to half the transmitted signal amplitude, can be transmitted along with the
208 Energy-Efficient Design of High-Speed Links

data, and the ability to vary the transmitted level enables lower power
dissipation. In the case of parallel links, several channels may share a single
reference line and overhead of the reference line can be amortized across
them all. For serial links, a reference voltage line may also be used, but
designers will more commonly use a differential signaling scheme where a
pair of wires carries complementary signals. Two implementations are
illustrated in Figure 8.5. One uses a differential pair with a single current
source that sets the output swing. The other implements a pair of single-
ended transmitters, each transmitting complementary data. The drawback of
using a differential pair arises from the reduced gate overdrive on the output
devices. Using larger devices can enable the same current drive at the
expense of larger capacitive loading on both the inputs and outputs that can
limit bandwidth and increase power.

A differential transmitter has several nice properties. The current


consumption of the link is constant and does not induce voltage spikes in the
power supply lines arising from parasitic inductance in the packaging. Tight
coupling of the lines enables low electro-magnetic interference (EMI) since
the return currents for the signals are through the adjacent wires. Lastly,
differential signals present larger effective signal amplitudes to the receiver
compared to a single-ended signal to facilitate the conversion of the signal
energy into digital data bits. However, these come at the cost of higher pin
resources. Thus, differential signaling is common in serial link designs, but
parallel links often require single-ended interfaces to reduce pin count.
Although differential signaling may appear to require higher power
dissipation since there are now two channels that switch, this is not always
the case. There have been recent investigations that compare single-ended
and differential signaling that show that lower signal-to-noise ratios are
achievable with differential signaling leading to lower transmitted swing
Overview of Link Design 209

levels [13]. Line power is a function of the transmitted voltage swing as


shown by the following equation:

where is the supply voltage, is the transmission line and termination


impedance, and is the signal’s activity factor. Therefore, lower power links
are possible with differential signaling.

8.2.2.3 Slew-rate Control

So far, it has been seen that reducing noise can lead to lower power link
designs. Package and connector non-idealities can be another source of
noise. High-frequency energy in the transmitted signal can interact with
parasitic RLC tanks to cause ringing in the line and coupling (cross talk) into
adjacent lines. Therefore, high-speed link designs often limit the edge rate of
transmitted signals to mitigate these effects. Implementing edge-rate control
is fairly straightforward and several examples can be found in the literature.
There are two general approaches used to implement edge-rate control. The
technique illustrated in Figure 8.6(a) limits the slew rate of signals by
controlling the RC time constant of the driver’s input signal [14]. This can
be achieved by adjusting the capacitive loading or by changing the drive
strength of the preceding predriver buffer and thereby varying its effective
output resistance. In so doing, the edge-rate of die signal also slews
accordingly at a controlled rate. Another technique, presented in Figure
8.6(b), breaks the driver input into smaller parallel segments and slews the
output by driving the segments in succession with some delay (often
implemented with an RC delay line) [15]. Care must be taken to guarantee
that the time constants of the signal slew are fixed in proportion to the
210 Energy-Efficient Design of High-Speed Links

symbol rate. Since both the RC of the predriver and the of delay elements
are dependent on process and operating environments, some mechanism for
controlling them is required. Time constants can be controlled manually or
with a simple control loop that relies on a process and environment
monitoring circuit. An inverter-based ring oscillator is a good example of
such a circuit [14]. The oscillation period of the ring is directly related to
process and environmental conditions. Therefore, by counting the
oscillations over a known period, a digital control loop can converge to the
appropriate slew-rate settings for the symbol rate. A system-level approach
to this basic concept that utilizes knowledge of the process and
environmental conditions of a chip can be extended to other parts of the link
interface to enable energy-efficient designs [13][16] and are discussed in
more detail in Sections 3 and 4.

8.2.3 Receiver

At the opposite end of the channel, a receiver circuit deciphers the incoming
analog signals into digital data bits. This block commonly consists of a
differential sampling circuit that samples the data in the middle of the
received symbol and amplifies the low-swing signal to binary levels. Single-
ended signaling connects the signal line to one input of the differential pair
while the other is set to a reference voltage to which the signal is compared.
Differential signaling connects each signal line to each side of the input
buffer. So, the effective voltage swing seen by the receiver is much greater
for differential signaling than single-ended signaling for the same swing
magnitudes. This effect enables differential signaling to require smaller
voltage swings, which can lead to lower power consumption.

While direct sampling of the incoming signal enables a simple design,


link designs often add a preconditioning stage before the sampler [17].
Overview of Link Design 211

Preconditioning has several advantages: it enables higher common-mode


rejection to relax the input dynamic-range requirements of the sampler; it
isolates the sampler from injecting noise back into the channel; and it offers
a way to filter the incoming signal. There are a few ways to implement this
preconditioning. One commonly-used technique converts the voltage into a
current and integrates charge over the bit time to convert the current back
into a voltage signal that can be sampled and is called a current-integrating
receiver [18]. The integration has several desirable properties when
receiving high-speed signals. Cross talk resulting from coupling can corrupt
signals. If a noise event occurs right at the sampling point of the received
signal, it can significantly degrade voltage margins and make the signal
indistinguishable. To avoid this problem, an integrating receiver does not
only look at the data at one moment in time but over the entire bit time.
Figure 8.7 (a) illustrates its implementation. The input voltage waveform
steers current through the differential pair from the integrating capacitors
and a sample-and-hold circuit delivers the measured voltage difference to a
sampler that amplifies the signal to digital logic levels. Integration mitigates
the effects of high-frequency noise. An alternative way to think about this is
that the integration implements a filter with its bandwidth equivalent to the
symbol rate. Hence, coupling noise events, which are normally high-
frequency disruptions, are filtered out. The noise rejection capabilities of the
integrating receiver can effectively lead to a more energy-efficient design
since larger swings are not necessary to overcome noise from cross talk.
Rather, the minimum signal-swing magnitudes required for this type of
receiver may depend on swing levels necessary to overcome offsets in the
differential pair and sampler. Minimizing offsets in the receiver via
calibration can lead to robust link designs with extremely low swing levels
[19].
A similar preconditioning scheme relies on an amplifier to buffer the
incoming signal from the sampler. In order to noise rejection characteristics
similar to those of the integrating receiver, the amplifier should have a
limited bandwidth set to no greater than the symbol rate of the incoming
signal. The bandwidth of this amplifier must not only track the incoming
symbol rate but also do so in the presence of process and environmental
variations. Figure 8.7 (b) presents a schematic of this type of receiver where
the bandwidth of the front-end amplifier is set by its output RC time constant
[13]. If the load’s impedance can track bit rate, process, and operating
conditions, the bandwidth can be set to reject high-frequency noise and only
allow energy up the symbol rate pass through, like the integrating receiver.
212 Energy-Efficient Design of High-Speed Links

8.2.4 Clock Synthesis and Timing Recovery

Both the transmission and reception of data symbols in high-speed links


must operate in lock step with respect to an accurate timing reference,
Deviations from ideal timing reference points can lead to errors in
communication, and, therefore, timing jitter and offsets must be limited. As a
review, this section presents approaches commonly used to generate clock
signals for the transmitter and receiver. Both blocks normally rely on a
phase- or delay-locked loop (PLL or DLL) to generate on-chip clock signals
that are locked with respect to an external reference. The selection of
utilizing either a PLL or DLL depends on the system-level environment due
to their respective advantages and disadvantages for generating low-jitter on-
chip clock signals. While the loops share several common building blocks,
their operation differs as a function of their configuration. A PLL must
integrate frequency in order to achieve lock while a DLL simply adjusts
delay [20][21].
These differing configurations lead to different input-to-output phase
transfer functions for each loop. The closed-loop phase transfer function of a
PLL exhibits a low-pass filter characteristic. Hence, it has the property of
rejecting high frequency noise from the input while tracking noise within the
bandwidth of the loop. The VCO frequency is driven as a function of the
phase error, but there is no direct signal path between the input clock source
and the on-chip clock. However, in the case of a DLL, the on-chip clock is
directly connected through delay elements to the input clock source, and the
input-to-output phase transfer function is effectively an all-pass filter. This
apparent drawback limits the appeal of using DLLs in systems that suffer
from a noisy input clock source. However, a DLL has advantages over a
PLL when internal noise sources (e.g., power supply noise) dominate. Since
a PLL relies on an oscillator, if a noise event perturbs an edge in the VCO,
the oscillator will recirculate the noise until the loop can compensate for it at
a rate set by the bandwidth of the loop. Therefore, wide bandwidth is
desirable to quickly recover from jitter due to on-chip noise [22]. On the
other hand, a DLL does not accumulate jitter over multiple clock cycles
since the delay line is reset every cycle. Hence, lower jitter may be possible
with a DLL when on-chip noise sources are the dominant cause of jitter.
Transmitters can utilize either a PLL or DLL to generate an on-chip clock
signal with respect to which data symbols are driven onto the channel [23].
High performance links often operate at bit rates higher than the block that
supplies the data. Therefore, the clock generator also serves to align and
serialize the parallel data. This often requires clock frequency multiplication
to generate a higher clock rate for data transmission with respect to the lower
clock rate at which the parallel data feeds the transmitter. In order to
Overview of Link Design 213

minimize timing uncertainty, the clock signal (or edge) ought to be


combined with the data at the latest possible point in the transmitter
datapath. However, drivers can be large in order to drive long distances and
present an appreciable capacitive load to the clock generator. Since the
activity factor of a clock signal is higher than the data, combining the data
with the clock signals before the ramp-up buffer chain can trade timing
uncertainty for energy efficiency. Other clocking strategies to enable energy
efficiency will be presented in Section 3.
The receiver also relies on a PLL or DLL to align the on-chip clock
signals with respect to the incoming data symbols in order to accurately
sample the data and differentiate successive bits from one another. The
specific implementation of the timing recovery circuit depends on the
architecture of the link. For source-synchronous parallel links, where a clock
signal is transmitted in parallel with the data, the clock-recovery loop locks
to the received clock signal and is used to sample the data signals. When no
explicit clock signal is provided, the timing-recovery block must extract
timing information directly from the data stream utilizing a phase-detecting
block.

In either configuration, a robust example of clock recovery utilizes a


dual-loop architecture introduced by Sidiropoulos, et al. in [24] and
illustrated in Figure 8.8. It relies on a core loop that generates coarsely
spaced clock edges that evenly span a clock period. These clock edges can
be generated with either a PLL or DLL. Then, a secondary loop utilizes an
interpolator to generate a finely spaced clock edge aligned to the incoming
data symbols. A phase-detecting block drives control circuitry that generates
a control word to select an adjacent pair of clock edges from the core loop
and appropriately weight the contribution of each edge in order to slide the
resulting on-chip clock edge into lock. This dual-loop scheme not only offers
infinite capture range, which is a limitation for conventional DLLs, but with
a sufficiently high slewing capability it can accommodate small frequency
214 Energy-Efficient Design of High-Speed Links

differences between the core loop’s clock rate and the data rate of the
received signal. This ability to compensate for frequency differences is
important for high-speed links because the opposite ends of a transceiver
may not share a common clock source.
Although the clock generation for the transmitter and receiver were
introduced separately, since the transmitter and receiver for different
channels reside on the same die, they may share some of the clock
generating components. More specifically, the core loop described for timing
recovery of a receiver may also serve as the clock generator for an adjacent
transmitter [25]. Such sharing of components not only reduces circuit
redundancy, but it obviates issues arising from having multiple loops on the
same substrate4. Moreover, on-chip clock generation and distribution is a
significant source of power consumption in high-speed links and efforts to
reduce this power can enable a much more energy-efficient design.

8.2.5 Putting It Together

This section has provided a brief overview of high-speed link design.


Several approaches for implementing each of the components are possible,
but the designer must first understand the system-level noise characteristics
in order to converge on the most efficient design. In both the transmitter and
receiver, a lower noise solution leads to lower energy since extra voltage
margins can be avoided. In clock generation and timing recovery, precise
placement of clock edges not only enables higher performance, but may also
enable some trade-offs between timing margin and energy.
As modern high-speed links strive for bit rates on the order of multiple
Giga-bits per second or higher, intrinsic losses in the channel due to
dielectric and skin loss can significantly degrade performance. The channel
looks like a low pass filter at frequencies greater than 1-GHz for traces
on a printed circuit board [26]. This frequency-dependent attenuation leads
to inter-symbol interference, which can severely close the received eye. This
is not a new problem but one copper-base wire-line communication links
(e.g., DSL, Ethernet) have been contending with for some time. A common
solution is to use equalization schemes to compensate for the low-pass
characteristics of the channel. Several high-speed link designs also employ a
type of equalization at the transmitter called pre-emphasis [27][28][29],

4
When multiple PLLs are integrated onto the same substrate, they may suffer from
injection locking if not isolated from one another and can be a significant source of clock
jitter [56].
Approaches for Energy Efficiency 215

where the transmitter pre-distorts the signal in anticipation of the filtering


caused by the channel. While equalization or pre-emphasis enables links to
achieve higher bandwidths, it can be fairly complex and costly in terms of
power.
In order to see what other techniques are available for reducing power
consumption in high links, the next section describes several approaches for
enabling energy efficiency by exploiting parallelism and an adaptive supply-
regulation technique.

8.3 APPROACHES FOR ENERGY EFFICIENCY

Now that we have an understanding for how some of the different design
choices affect the energy efficiency of high-speed link designs, this section
further investigates approaches specifically targeted to improve energy
efficiency. Energy consumption has been a growing concern in building
large digital systems (e.g,. microprocessors) and has led to several
advancements to reduce power consumption [30][31][32]. Since high-speed
links are by nature mixed-signals designs (consisting of both digital and
analog circuits), one can leverage many of the observations and techniques
applied to digital systems to build energy-efficient links. One approach can
be as simple as taking advantage of the next generation process technology
to enable lower energy consumption for the same performance. Parallelism
is another technique that digital designers have used to reduce power without
sacrificing performance. This section looks at several forms of parallelism
that are also possible in link design. Lastly, adaptive power-supply
regulation, a technique that has enabled energy-efficient digital systems, is
introduced and its application to the design of high-speed links is presented.

8.3.1 Parallelism

Parallelism has often been used in large digital systems as a way to achieve
higher performance while consuming less power at the expense of larger
area. Breaking up a complex serial task into simpler parallel tasks enables
faster and/or lower power operation in the parallel tasks. For links, the goal
is to reduce power consumption in the overall design without sacrificing bit
rate. An obvious way to parallelize an interface is to utilize multiple links to
achieve the desired aggregate data throughput (i.e., parallel links). Parallel
links can operate at lower bit rates in order to mitigate channel non-idealities
(e.g., skin and dielectric loss, and cross talk) and enable an energy-efficient
interface. However, this pin-level parallelism comes at the expense of pin
and channel resources, which are not always abundant in many
216 Energy-Efficient Design of High-Speed Links

communication systems. Parallelism can also be applied to individual links


via two parameters – time and voltage. Examples of parallelism in time are
prevalent in modern link designs with double data-rate (DDR) and quad
data-rate (QDR) memory interfaces being the most visible [33]. Parallelism
in voltage can also be seen in many communication links from several
generations of Ethernet links to proprietary backplane transceiver designs.
Both of these forms of parallelism offer higher performance and/or power
savings by reducing the internal clock rate within the transceiver relative to
the bit rate of the link. This section further investigates both of these forms
of parallelism in detail.

8.3.1.1 Sub-clock Period Symbols

The clock rate of a chip limits link performance when the bit rate is equal to
the clock frequency. Even with aggressive pipelining to reduce the critical
path delay in the datapath, there is a minimum clock cycle time required to
distribute and drive the clock signal across the chip. As seen in Figure 8.9,
as the clock cycle time shrinks, expressed in terms of fanout-of-4 (FO4)
inverter delays5 on the x-axis, it experiences amplitude attenuation as it
propagates through a chain of inverters [34]. The minimum cycle time that
can be propagated is roughly 6 inverter delays. Transmitting at this clock
rate limits the bit rate to less than 1-Gb/s in a technology. However,
higher bit rates are desirable in high-speed links, and, therefore, transmitting
several bits within a clock cycle is required for higher data rates.
5
A fanout-of-4 inverter delay is the delay of an inverter driving a load equivalent to four
times its own input capacitance. A fanout of 4 is used since that is the optimal fanout for
implementing a ramp-up buffer chain to drive a large capacitive load with minimum delay.
Approaches for Energy Efficiency 217

Transmitting multiple bits within a clock period is not only a way to


improve performance, but it also offers a way to reduce power consumption
in the interface. Multiple clock phases can be generated using a ring
oscillator or delay line and driven to the transmitter. Combining the finely-
spaced clock edges with data can delineate shorter symbol intervals. A
simple analysis of the power consumed by the clocks for such a scheme with
N bits transmitted per clock period shows that to first order, the power
consumption is the same in each case as demonstrated by the following
equation of the total power:

where N is the number of bits transmitted per clock period, K is a scaling


factor to account for the clock distribution, is the effective capacitive
load of the transmitter, V is the supply voltage, and is the clock
frequency. In the expression, the Ns cancel, and so the total power remains
unchanged. However, the above scenario assumes that the voltage remains
the same for each case. For a lower clock rate, the inverters in the clock
distribution network do not need to operate as quickly and hence can operate
off of a lower supply voltage. Reducing voltage offers significant energy
savings since energy is a function of Furthermore, the multiple clock
phases required to enable sub-clock period symbols may be generated
locally and therefore avoid the power required to route them from the clock
generator to the transmitter. Examples of high-speed link designs that
leverage these power-saving concepts are presented in detail in Section 4.
One caveat of utilizing multiple clock phases stems from phase offsets
that can eat into the timing margin of the link. Even in a low-noise
environment, process variations can cause skews in a multi-phase clock
generator as each delay element experiences device mismatches, resulting in
variations in transmitted symbol times. In comparison, the overall clock
period is immune to these offsets since each period is a combination of the
same mismatched-circuit delays. The resulting jitter seen by the receiver
occurs at a relatively high frequency such that the timing-recovery block
would not be able to track it. Fortunately, these offsets are static and can be
tuned out with additional offset-correction circuitry [35][36]. However, this
additional circuitry comes at the expense of higher complexity and power
consumption. Moreover, there is a limit to the amount of parallelism possible
that is set by the bandwidth of the transmitter and receiver circuits and the
non-idealities of the channel that plague high bit-rate links. The designer
218 Energy-Efficient Design of High-Speed Links

must trade the matching properties of the delay elements and clock
distribution circuits used with the power and performance targets sought.

8.3.1.2 Pulse-amplitude Modulation

One can also break up the signal voltage swing into smaller segments to
encode multiple bits of data in one transmitted symbol. Pulse-Amplitude
Modulation (PAM) is a technique that enables higher bit rates without the
need for higher clock rates and has been demonstrated in several high-speed
link designs [28][7]. It relies on parallel transmitters to drive the channel by
encoding multiple bits into different voltage levels within a symbol as shown
by an example of a PAM-4 implementation in Figure 8.10. One of the
advantages of PAM is that the energy of symbols transmitted down the
channel is over a lower frequency spectrum than binary transmission at the
same bit rate. Hence, it experiences less distortion and loss through the
channel. Unfortunately, encoding bits into multiple amplitude levels reduces
voltage margins, and, therefore, this scheme is more susceptible to cross talk
[37].
The approaches for enabling more energy-efficient link designs
investigated so far have relied on the ability to reduce clock rates in order to
reduce power consumption without sacrificing bit rate. They all can leverage
energy’s dependence and trade circuit speed for lower energy
consumption. A dynamic voltage-scaling technique called adaptive power-
supply regulation extends this idea to maximize energy efficiency by
adjusting the supply voltage with respect not only to speed but also to
process and environmental conditions. It is described next.

8.3.2 Adaptive Power-Supply Regulation

The pursuit of reducing energy consumption in large digital systems has led
to a technique called adaptive power-supply regulation or dynamic voltage-
Approaches for Energy Efficiency 219

scaling, which maximizes energy efficiency in digital circuits by reducing


the supply voltage down to the minimum required for proper operation
[39][38][40]. By tracking process and environmental conditions, this
technique obviates the need for large timing margins normally required in
conventional designs to accommodate process and temperature variations
within and across chips. This section will focus on the general approach for
adaptively regulating the power supply of digital systems and how it enables
energy-efficient operation. Section 4 then extends its application to high-
speed link design through detailed examples.

The advantages of adaptively regulating the supply voltage for energy


savings is best demonstrated by looking at how the delay of an inverter
changes with supply voltage and then understanding its implications on
energy. The delay of digital CMOS circuits depends on three main
parameters – process, temperature, and supply voltage. Variability in
manufacturing results in chips that exhibit a range of performance due to
variations in device thresholds, oxide thickness, doping profiles, etc.
Operating conditions also affect performance. Temperature affects the
mobility of holes and electronsas well as the transistor’s threshold voltage.
Lastly, circuit delay strongly depends on supply voltage. Delay variation of a
typical fanout-of-4 (FO4) inverter6 versus supply voltage in a typical

6
A fanout-of-4 inverter is an inverter that drives another inverter with four times its own
input capacitance.
220 Energy-Efficient Design of High-Speed Links

CMOS process is shown in Figure 8.11. Assuming that the critical path
delay of a digital system is a function of some number of inverter delays
[40], the normalized frequency of operation versus supply voltage can be
found by inverting and normalizing the inverter’s delay and is also presented
in Figure 8.11. The frequency of operation achievable by a chip is roughly
linear to supply voltage.

To understand what this relationship means for power, this delay data can
be applied to the dynamic power equation (equation 8.1), and the resulting
normalized power is plotted relative to normalized frequency for two supply
voltage configurations in Figure 8.12. Given a fixed supply voltage, power
consumption is proportional to frequency, resulting in a straight line in this
figure. Reducing frequency lowers power consumption. Moreover, since
gate delay can increase if the required operating frequency is reduced, the
circuit can operate at lower supply voltages when operating at lower
frequencies. Hence, by reducing both frequency and supply voltage, power
consumption reduces dramatically, proportional to frequency cubed.
In addition to the energy savings possible by adaptively regulating the
power supply down to lower levels for lower frequencies, there is a potential
for saving energy due to inefficiencies found in conventional designs that
operate off of a fixed supply voltage. Variability in circuit performance due
to process and temperature variations requires conventional designs to
incorporate overhead voltage margins to guarantee proper operation under
worst-case conditions. This is due to the circuit delay’s strong dependence
on process parameters and temperature. This overhead translates into excess
Approaches for Energy Efficiency 221

power dissipated to allow margins for worst-case corners. Although the IC


industry deals with process variability by speed binning, especially for
commodity parts such as semiconductor memories and microprocessors,
operating temperature generally cannot be known a priori, and, therefore,
chips still need margins to meet specifications over a wide range of
temperatures. By actively tracking on-die environmental conditions,
dynamic supply-voltage regulation can accommodate the performance
differences imposed by temperature variations to minimize energy
consumption.
For this technique to work, active tracking of how on-die environmental
conditions affect circuit performance (more specifically, the critical path
delay) is required. In high-speed links, the minimum clock period required
for clock distribution often sets the critical path. Therefore, a chain of
inverters can be used to model the critical path consisting of inverters in the
clock distribution network. Given this model of the critical path, adaptive
power supply regulation needs to generate the minimum supply voltage
required for proper operation at the desired frequency and efficiently
distribute it. This task requires two components: an efficient power-supply
regulator and a control mechanism to generate the correct voltage. Although
a linear regulator can be used to supply power as demonstrated in [39], the
power that the regulator itself consumes can be substantial and, therefore,
counteracts the power savings of this approach. Instead, a switching
regulator that has much higher conversion efficiency is preferred. Several
implementations of digitally controlled switching regulators can be found in
the literature. In each implementation, a feedback loop utilizes a model of
the critical path to find the minimum voltage required for the desired
frequency of operation. Feedback control loops that rely on simple digital
integrators or look-up tables to set the appropriate voltage with respect to
predefined performance targets in the form of a digital word or frequency
have been demonstrated in [41][42][43][44][45][40]. A more sophisticated
implementation utilizing sliding-window control is also possible [46]. Most
of these implementations have been applied to complex digital systems, such
as general-purpose microprocessor and DSP cores, with conversion
efficiencies close to or greater than 90%. They offer an energy-efficient
mechanism for adaptively regulating the supply voltage, which can be
applied to a parallel I/O subsystem that resides within a larger digital chip or
to a stand alone high-speed serial link.
Since a high-speed link is inherently a mixed-signal design consisting of
both digital and analog components, there is a potential to leverage this
supply-regulation technique to conserve energy in the digital portions of the
chip. While the application is obvious for the clock distribution and datapath
blocks (serializer and de-serializer) that surround the transceiver core,
222 Energy-Efficient Design of High-Speed Links

dynamically scaling the supply also offers several properties that enable the
designer to replace several precision analog circuit blocks with digital gates.
This is especially appealing for future process technologies that aggressively
scale both voltage and feature size. Section 4.2 describes a serial link design
that adaptively regulates its supply voltage to enable energy-efficient
operation.

8.3.3 Putting It Together

This section investigated several possible techniques, commonly found in


digital systems, applicable to high-speed link designs to enable higher
energy-efficiency. Parallelism is possible in both time and voltage to reduce
the clock rates within the link interface circuitry. However, this does come at
the expense of lower timing and voltage margins. Moreover, this clock rate
reduction can lead to lower power consumption. One can further extend this
trade-off by reducing the supply voltage when operating at lower bit rates in
order to maximize energy efficiency. Hence, there is a trade-off between
performance and energy consumption. The next section investigates several
examples that leverage many of the techniques and trade-offs described thus
far in this chapter to build energy-efficient links.

8.4 EXAMPLES

Several examples of low-power, energy-efficient link designs and link


building blocks can be found in the literature. Since it would be impractical
to investigate all of them, this section focuses on three examples. They all
share a common theme of utilizing adjustable supply regulation applied to
some, if not all, of its link components in order to reduce power consumption
and enable energy-efficient operation. Clock generation can be a significant
component of overall power consumption in links, and so this section begins
with an example of utilizing supply-regulated inverters as delay elements in
DLLs and PLLs. We will focus on the implementations found in [47], which
have been used in several links designs to reduce power consumption. The
next example looks at a serial link interface that utilizes adaptive power-
supply regulation to enable energy-efficient operation across a wide range of
frequencies and corresponding supply voltages. Further reduction of power
consumption is possible by employing some of the low-power techniques,
such as parallelism, discussed in the above section. This example employs
parallelism to reduce the bit time to a single inverter delay while maintaining
a lower clock rate. The last example details another serial link example that
transmits with a sub-clock-period bit time. The design reduces power by
Examples 223

serializing the transmitted data further upstream to reduce the clock-loading


penalty and minimizes receiver offsets through calibration to enable small
signal swings. This section is by no means a thorough investigation of each
of the examples introduced above. Rather, this section highlights the key
features in each design and presents some experimental results to
demonstrate what is possible.

8.4.1 Supply-Regulated PLL and DLL Design

Clock generation for both the transmitter and receiver is a critical component
that sets the performance of high-speed links. The study and implementation
of PLLs and DLLs has been extensive over the past few decades with special
attention placed on minimizing jitter. As mentioned earlier, the VCO in a
PLL is especially sensitive to noise, which has led to the development of
self-biased differential delay elements by Maneatis [48], which have good
power-supply noise rejection properties. In recent years, a slightly different
approach to building PLLs and DLLs with good noise rejection properties
has emerged [47]. This approach relies on a linear regulator to drive simple
delay elements comprised of inverters. The delay of these inverters is
controlled directly through their supply voltage instead of modulating
current or capacitive loading. Enabling high power-supply rejection at the
output of the regulator isolates the control node from noise on the power
supply lines. In addition to low jitter characteristics, this approach eliminates
static current delay elements to also enable lower power operation. This
section highlights the particular challenges that supply-regulated delay
elements present to the design of PLLs and DLLs. Implementation details of
a linear regulator and charge pump that are common to both PLL and DLL
designs are described and show how one can build low-jitter loops whose
power consumption and bandwidth track with frequency.

8.4.1.1 DLL

In order to build PLLs and DLLs with robust operation over a wide range of
frequencies, one would like to have their bandwidths track the operating
frequency. Then, the loop parameters can be optimized to the lowest jitter
settings [22]. Taking a look at the stability requirements for each loop
elucidates some of the challenges of using supply-regulated inverters as
delay elements. The transfer function of a DLL can be modeled with a single
dominant pole as:
224 Energy-Efficient Design of High-Speed Links

where represents the dominant pole frequency (also equivalent to the loop
bandwidth). Ideally, should track with where the loop bandwidth is
always 10-20x lower than the operating frequency, so that the fixed delay
around the loop results in a small negative phase shift. can be modeled by
the following equation:

where is the charge-pump current, is the loop filter capacitor, is


the delay-line gain, and is the input frequency. will track if
and are constant with frequency. Unfortunately, is not constant
with frequency since the delay of an inverter is not linear with voltage. Since
is nominally fixed, the charge pump design ought to have track
in order to build a robust DLL.

In addition to the stability constraints described above, both current and


voltage must be driven to the supply-regulated inverters. Hence, both
designs require a linear regulator that buffers the control voltage on the loop
filter and drives the inverters. A block diagram of the regulated-supply
buffers and a two-stage current mirror based regulator design are presented
in Figure 8.13. By keeping the inter-stage mirroring ratio low, the
amplifier is virtually a single-pole system and does not require stabilizing
compensation. A current mirror driven by the loop control voltage sets
Examples 225

the differential pair bias current. In a unity gain configuration, the


transconductance of the amplifier is simply Since the
transconductance of the first stage is a function of the bandwidth
of the regulator tracks with operating frequency and does not
compromise the enclosing PLL/DLL stability even with variations in process
and operating environment. Furthermore, the operating current of the
amplifier also scales with frequency.

A charge pump design where is also a function of the control voltage


for the loop is presented in Figure 8.14. Utilizing a long channel
device in the current source yields the following relationship between
current and voltage:

For a DLL, the delay-line’s delay can be modeled by the following


expression:

where N is the number of stages in the delay line, and is the capacitive
load seen by each delay stage. Taking the derivative with respect to
yields the following expression for delay-line gain:
226 Energy-Efficient Design of High-Speed Links

where can vary from 1 to 2. Plugging equations (8.7) and (8.9) into
equation (8.6) yields a ratio between and

where is small for modern short-channel devices. Therefore, it is


nominally fixed as a ratio between two capacitances whose values ought to
track each other over voltage. The resulting DLL design’s delay line consists
of six inverter stages locked to 180° of the input reference clock signal. The
linear regulator in a unity-gain configuration replicates the control voltage
on the loop filter and drives the delay elements. The DLL’s bandwidth tracks
operating frequency and current consumption in the loop and also scales to
enable lower power consumption at lower frequencies.

8.4.1.2 PLL design

Due to phase integration in the VCO, a PLL is at least a second-order system


and necessitates a zero for stability. The zero is often implemented with a
resistor in the loop filter and can be modeled by the following transfer
function:

where bandwidth and damping factor are given by:

is the charge-pump current, is the VCO gain (which is roughly


constant), R is the loop-filter resistor, and is the loop-filter capacitor. In
order to achieve a wide lock range with maximum bandwidth, must track
the operating frequency while keeping constant. Simply adjusting so
that tracks frequency (as in the case for a DLL) will compromise loop
stability by overly reducing at lower frequencies. Instead, both and R
should be varied such that remains constant over the operating
frequency range.
Utilizing the charge pump and linear regulator designs described above also
yields a PLL design that meets stability constraints over a wide frequency
Examples 227

range. In order to satisfy the constraint that be constant with frequency, the
resistor can be implemented with active components. In a conventional
design, the control voltage is a combination of the aggregate charge stored
on the loop filter capacitor plus the instantaneous voltage across the filter
resistor. This is analogous to an implementation where the voltage on the
capacitor is buffered through a unity-gain amplifier and then augmented by
the instantaneous voltage formed by a second charge pump and the
amplifier’s output impedance [48]. Now, simply changing the second
charge-pump’s current varies the effective loop resistor. The resulting loop
configuration is shown in Figure 8.15. The VCO consists of five inverter
buffers in a ring and an amplifier converts the VCO output to full CMOS
levels to drive the phase-frequency detector (PFD). The output of the PFD
drives two charge pumps. [47] shows that the resulting loop has bandwidth
and damping factor governed by the following nominally fixed ratios:

where is again the capacitance load of each buffer stage. Hence, robust
operation is possible over a wide frequency range by keeping and
nominally fixed, and this scheme enables the optimal scaling of loop
dynamics to minimize jitter. Like the DLL, the current consumption of the
loop components track with operating frequency to enable lower power
consumption when operating at lower frequencies.

PLL and DLL designs utilizing supply-regulated delay elements offer


robust operation with the ability to scale their overall power consumption
with respect to the frequency of operation. Variations of these loops are used
in the subsequent sections for clock generation and timing recovery to build
228 Energy-Efficient Design of High-Speed Links

energy-efficient links. The next example extends the idea of regulating the
supply voltage beyond the delay elements to drive the entire serial-link
interface.

8.4.2 Adaptive-Supply Serial Links

Two examples of adaptive power-supply regulation applied to high-speed


interfaces can be found in the literature. A detailed description of a source-
synchronous parallel I/O that leverages the energy savings possible by
reducing the supply voltage along with clock rate is presented in [13]. In
addition to trading performance for energy reduction, the design obviates the
need for additional adjustment circuitry that scales certain properties of the
link proportionally with bit rate. For example, the loop bandwidths of PLLs
and DLLs, the slew rate of transmitters, and the bandwidth of the receiver all
track bit rate by exploiting the adaptive supply as a global bias voltage. Kim
and Horowitz extend this adaptive power-supply regulation technique to
serial links and solve several limitations that plague link circuitry operating
at low voltage levels. This section discusses some of the implementation
details of adaptive supply serial links fully described in [16].
Serial links typically transmit more bits per clock period by exploiting
parallelism in the transmitter and receiver and utilize multi-phase clocks to
provide fine timing information at lower frequencies [34]. As discussed in
Section 3.2, transmitting multiple bits within a clock period not only
improves the performance but also leads to further reduction in power
consumption if the supply voltage is dynamically scaled. At a given bit rate,
the power of the adaptive supply serial link drops quadratically to the
multiplexing rate but at the cost of increased area. Thus, the combination of
parallelism and an adaptive supply allows one to trade area instead of
performance for lower power dissipation.
This section describes the various components required in a supply-
regulated serial link design with sub-1V operation. Several challenges are
present as higher multiplexing rates are pursued. First, generating a large
number of clock phases at one place and then distributing them to multiple
I/O circuits becomes extremely difficult because strict matching between
clock paths is required to minimize static phase offsets. Instead, one can
generate the multi-phase clocks locally at each I/O circuit by exploiting the
coarse tuning voltage distributed by a global control loop. With adaptive
power-supply regulation, the adaptive power-supply regulator serves as this
global control loop, and the adaptive supply-voltage acts as the coarse tuning
voltage for the local clock generators. The clock generation and distribution
details are discussed in Section 4.2.1. The second challenge is to design I/O
circuits that can operate at low supply voltages. In order to achieve power
Examples 229

reduction with parallelism, one assumes that performance of I/O circuitry


scales with the supply voltage and tracks the delay of an inverter [13]. Most
conventional I/O circuits face problems as the supply voltage approaches the
threshold voltage of transistors since they are not purely comprised of digital
logic gates but contain some analog circuitry. Section 4.2.2 describes
techniques to extend the lower supply limit down to Lastly, to
reduce pin count, typical serial links do not send clock information along
with the data, so the receive side of the transceiver must recover timing
information from the data stream. Both PLL and DLL implementations are
possible, but in the case of multi-phase clock recovery, this example
demonstrates that a PLL is more power and area efficient. The rest of this
section describes the implementation of the clock recovery block.

8.4.2.1 Multi-phase Clock Generation

Figure 8.16 illustrates the block diagram of multiple serial links with an
adaptive power-supply regulator and local clock generators. The adaptive
power-supply regulator adjusts the supply voltage using digital sliding
control [46] so that the reference VCO oscillates at the desired operating
frequency Sliding control is a nonlinear control mechanism widely used
230 Energy-Efficient Design of High-Speed Links

in switching supplies and has superior stability and transient response to


linear control [49]. Digital implementation of the sliding control has the
benefit that the critical path delay of the controller scales with the reference
frequency. Most linear and PWM-based controllers do not have this property
since the controller must operate at a fixed frequency [50]. This benefit
allows the digital sliding controller to operate off of the variable regulated
supply. Therefore, the power of the controller itself scales with the load
power, and the controller power overhead remains a constant portion of the
total power over a wide operation range. Implementation details and design
analysis of this power supply regulator are explained in [46].
The VCOs of local clock generator PLLs are identical to the reference
VCO inside the adaptive power-supply regulator. Thus, when the VCOs
operate at the adaptive supply, they are expected to oscillate at frequencies
very close to This way the adaptive supply is acting as a coarse tuning
voltage for the PLLs. The feedback adjustment of the local PLLs only need
to compensate for within-die variation and thus can have a narrow tuning
range (+/-15%). This eases the design of the PLL since noise on the control
voltage has less impact on the VCO clock jitter.

8.4.2.2 Low-voltage Transmitter and Receiver

As bit period reduces to less than 2 gate delays, it is difficult to multiplex


data on to an on-chip high-impedance node (although it can be more power
and area efficient [19]). Therefore, for high orders of multiplexing, the off-
chip low-impedance node of the transmitter output is chosen. pMOS drivers
transmit the signal referenced to ground since the supply voltage is
Examples 231

adaptively adjusted relative to the chip’s process and operating conditions


and thus can no longer be a common reference across different chips. Each
driver consists of two transistors in series, and each drives the output for a
bit period, which is defined by the overlap of two clocks with adjacent
phases [36]. Predrivers qualify the clocks depending on the data being
transmitted.
This conventional transmitter encounters a couple of problems as the
supply voltage reduces. First, the output swing drops rapidly as the supply
voltage approaches the threshold voltage, since the current of the pMOS
driver scales as Second, the output pulse width narrows as supply
voltage drops due to the threshold-voltage dependent switching points. Both
of these problems are related to the threshold voltage of the pMOS driver
and are limitations of the design in [13], which can be mitigated by making
the threshold voltage effectively zero. The transmitter illustrated in Figure
8.17 uses a level-shifting predriver that shifts its output voltage level down
by a threshold voltage and so the gate voltage of the pMOS driver
swings between and Transistors M1 and M2 suppress leakage
currents when the pMOS drivers are barely off with the gate voltages at V-
The gate overdrive of the pMOS driver is now V instead of and so
the output swing scales as with reasonable output swings even at low
supplies. The switching points are now independent of the supply, and the
output pulse-width stays relatively constant across variable supply-voltage
levels.
At the receiving end, parallel sets of current-integrating receivers de-
multiplex the data stream. The receiving window of each receiver is defined
by a set of multi-phase clocks. For high-frequency noise rejection, each
front-end of the receiver integrates the signal during a bit period [18]. This
implementation avoids voltage headroom issues associated with low-voltage
operation by boosting the supply voltage to the integrating receiver circuitry
and eliminating the sample and hold circuitry found in the original design.
Lastly, a modified, comparator circuit that can operate at sub-1V supply
levels amplify the integrated signal to full logic levels.

8.4.2.3 Clock-recovery PLL

In the absence of a dedicated parallel clock signal, each serial link must
recover timing information from the data stream. Figure 8.18 illustrates the
clock-recovery PLL implemented. A duplicate set of data receivers sampling
the edges instead of the center of the data eye enables phase detection but
provides only binary information on the phase. Hence, PLLs with binary
phase-detectors are bang-bang controlled [51] and, they must have low loop
232 Energy-Efficient Design of High-Speed Links

bandwidths to minimize dither jitter when locked. This low bandwidth


results in a very narrow frequency range (+/-2%) that the PLL can
lock.Thus, frequency acquisition aid is necessary to guide the VCO
frequency to fall within the lock-in range. However, since the VCO
frequency is coarsely tuned by the adaptive supply and already close to the
desired frequency, simple frequency sweeping can be used [52]. During
preamble mode, the transmitter sends a full-transition signal (10101010 bit
pattern), and the receiver can detect cycle slipping when consecutive 1’s and
0’s are received. The frequency sweeper initializes the control voltage
to the highest value and then steps it down whenever cycle slipping is
detected. As the loop approaches lock, cycle slipping happens less
frequently, and the phase-acquisition loop finally pulls the loop into lock.
Although DLL-based per-pin clock recovery is also possible, it requires
additional overhead to generate multiple phases of the clock (either multiple
mux/intepolator paths or a 180-degree delay-line) [24][13]. PLL-based clock
recovery circuits can generate multiphase clocks from the VCOs and,
therefore, are more power and area efficient compared to their DLL
counterparts.

8.4.3 Low-Power Area-Efficient Hi-Speed I/O Circuit


Techniques

In addition to adaptive power-supply regulation, other examples of low-


power I/O circuit techniques to implement high-speed serial links can be
found in the literature. Lee, et al. investigate three techniques to achieve
Examples 233

small area and low power in [19] and [53] and demonstrate a high-speed
link, implemented in a CMOS technology, that operates at 4-Gb/s
while dissipating 127mW. This link design example also multiplexes several
bits within a clock period to achieve high bit rates, but instead of
multiplexing at the transmitter output, multiplexing is performed further
back in the transmit path in order to reduce clock energy. In order to attain
the speed necessary in the circuitry following the mux point, lower voltage
swings in the signal paths are used. The design also implements a DLL with
supply-regulated inverters to generate low-jitter clocks while reducing power
consumption. Clock recovery is achieved with a dual-loop design similar to
the design described in Section 2.4. Lastly, a capactively-trimmed receiver
enables reliable operation at very low signal levels by compensating for
device offsets. Since the DLL design used for clock generation is similar to
the supply-regulated designs previously described in this section, the design
of the transmitter and receiver will be the focus here.
234 Energy-Efficient Design of High-Speed Links

8.4.3.1 Transmitter

While the shortest clock period for a technology is limited by the


requirements for distributing the clock signal without attenuation, higher bit
rates can again be achieved through parallelism by multiplexing several bits
within a single clock period. The input-multiplexed transmitter employed is
illustrated in Figure 8.19. It consists of a 4:1 multiplexer, a pre-amplifier,
and an output driver. Differential signaling is possible with dual pseudo-
nMOS multiplexers that generate complementary symbols. Four parallel sets
of series nMOS pull-down networks gate the data bits with different clock
phases to drive the pre-amplifier with symbols for each side at a rate four
times greater than the base clock rate. The minimum symbol time is limited
to two FO4 inverter delays to avoid significant amplitude attenuation, which
could lead to significant ISI within the transmitter drivers. A fully
differential output driver can be driven with the low-swing signals out of the
pre-amplifier. Tight control over this pre-amplifier’s output swing
guarantees that the bandwidths required for the high rates are achieved.
The differential link is doubly terminated and utilizes a two-tap FIR pre-
emphasis filter to combat channel non-idealities. Both the transmitter and
receiver are terminated with pMOS resistors tuned via 18 bits of
thermometer-coded control. In order to keep the pMOS devices in the linear
region, the driver’s output swing must be constrained. In this
implementation, the output swing is limited to no greater than 200mV to
limit resistance variation to within 10%. The FIR filter is implemented as a
2-bit DAC by summing two legs of transmitter drivers to the driver output
and controlling their bias currents to generate the filter coefficients
appropriate for the channel.

8.4.3.2 Receiver

Four parallel sets of receivers de-multiplex the incoming signal as shown in


Figure 8.20. The receiver’s front-end amplifier is a modified StrongArm
sense amplifier with trimming capacitors at the output to compensate for
offset voltage. Parallel pMOS capacitors enable 4-bits of resolution on nodes
a and b. Results show that this scheme can reduce up to 120mV of offset
down to 8mV. Reducing this offset enables reception of smaller signal
swings, which leads to lower overall power consumption. Experimental
results of the transceiver implementation show that swing levels less than
10mV and larger timing margins are possible with offset calibration. The RS
latch following the receiver front-end amplifier holds the data for subsequent
use.
Examples 235

8.4.4 Putting It Together

This section has described four implementation examples consisting of


energy-efficient serial links and timing generator building blocks. Energy
savings are possible when the link components run at the minimum voltage
necessary for the operating frequency or bit rate [13]. In addition to
dynamically scaling the voltage, parallelism offers further energy savings.
The adaptive-supply serial link demonstrates that by enabling bit times that
are at a fraction of the clock cycle time, high performance can be achieved
while running the rest of the clock distribution and digital circuitry at a lower
frequency and voltage [16]. Lee, et al. also demonstrate that low-swing
transmitter predrivers can achieve the speeds necessary for high bit rates and
reduce clock loading by performing the multiplexing function further up in
the transmit datapath [19]. Lastly, reducing the transmitted swing magnitude
reduces the overall power consumption of high-impedance drivers [53]. By
combining the various techniques employed by each of these examples,
energy-efficient link designs are achievable in modern digital and
communication systems that require high bandwidth communication
between chips.
236 Energy-Efficient Design of High-Speed Links

8.5 SUMMARY

The advancements in CMOS technology have brought about a significant


increase in the performance and functionality possible with large digital
systems. Intra-chip communication and clock speeds have been tracking
with technology scaling as devices get faster.7 Unfortunately, package and
channel mediums for inter-chip communications have not advanced at the
same rate. Therefore, high-speed signaling techniques were developed to
alleviate the communication bandwidth bottleneck. As seen for digital
systems where the pursuit of low-power, energy-efficient designs has
become just as significant as the pursuit of raw speed and performance,
designers are looking for new and innovative ways to build energy-efficient
links. This chapter has provided a brief overview of link design and
presented various trade-offs and techniques for energy-efficient operation.
Further research and development in low-power circuit techniques,
packaging, and interconnect technology should continue to improve energy
efficiency of links. However, one can again learn from digital systems
designers who have been able to find lower-power implementations by re-
visiting the system from an architectural and algorithmic level [54].
Similarly, investigating alternative modulation schemes and communication
methods may offer other innovative energy-efficient link solutions.

REFERENCES
[1] G. Besten, “Embedded low-cost 1.2Gb/s inter-IC serial data link in 0.35mm CMOS,”
IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 250-251.
[2] M. Fukaishi et al, “A 20Gb/s CMOS multi-channel transmitter and receiver chip set for
ultra-high resolution digital display,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech.
Papers, Feb 2000, pp. 260-261.
[3] S. Sidiropoulos et al, “A CMOS 500Mbps/pin synchronous point to point interface,”
IEEE Symposium on VLSI Circuits, June 1994.
[4] T. Tanahashi et al, “A 2Bb/s 21CH low-latency transceiver circuit for inter-processor
communication,” IEEE Int’l Solid-State Circuits Conference Dig. Tech. Papers, Feb.
2001, pp. 60-61.
[5] P. Galloway et al, ”Using creative silicon technology to extend the useful like of
backplane and card substrates at 3.125 Gbps and Beyond,” High-Performance System
Design Conference, 20001.

7
Of course, one cannot ignore the effects of wire parasitics, which do not scale quite as
nicely and are now what limit high-speed digital circuit performance [55].
Summary 237

[6] R. Gu et al, “ 0.5-3.5 Gb/s low-power low-jitter serial data CMOS transceiver,” IEEE
Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb 1999, pp. 352-353.
[7] J. Sonntag et al, “An adaptive PAM-4 5 Gb/s backplane transceiver in 0.25um CMOS,”
IEEE Custom Integrated Circuits Conference, to be published 2002.
[8] Y.M. Greshishchev et al, “A fully integrated SiGe receiver IC for 10Gb/s data rate,”
IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 52-53.
[9] J.P. Mattia et al,“A 1:4 demultiplexer for 40Gb/s fiber-optic applications,” IEEE Int’l
Solid-State Circuits Conf. Dig, Tech. Papers, Feb. 2000, pp. 64-65.
[10] Reese et al “A phase-tolerant 3.8 GB/s data-communication router for muli-processor
super computer backplane,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, pp.
296-297, Feb. 1994.
[11] E. Yeung et al, “A 2.4Gb/s/pin simultaneous bidirectional parallel link with per pin skew
compensation ,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp.
256-257.
[12] J. Proakis, M. Salehi, Communications Systems Engineering, Prentice Hall, New Jersey,
1994.
[13] G. Wei et al, “A variable-frequency parallel I/O interface with adaptive power-supply
regulation,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11, Nov. 2000, pp. 1600-
1610.
[14] B. Lau et al, “A 2.6Gb/s multi-purpose chip to chip interface,” IEEE Int’l Solid-State
Circuits Conf. Dig. Tech. Papers, Feb 1998, pp. 162-163.
[15] A. DeHon et al, “Automatic impedance control,” 1993 IEEE Int’l Solid-State Circuits
Conf. Dig. Tech. Papers, pp. 164-5, Feb. 1993.
[16] J. Kim et al ,“Adaptive supply serial links with sub-IV operation and per-pin clock
recovery,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb 2002.
[17] K. Donnelly et al, “A 660 MB/s interface megacell portable circuit in 0.3um-0.7mm
CMOS ASIC,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, pp. 290-291,
Feb 1996.
[18] S. Sidiropoulos et al, “A 700-Mb/s/pin CMOS signalling interface using current
integrating receivers,” IEEE Journal of Solid-State Circuits, May 1997, pp. 681-690.
[19] M. -J. E. Lee et al, “Low-power area efficient high speed I/O circuit techniques,” IEEE
Journal of Solid-State Circuits, vol. 35, Nov. 2000, pp. 1591-1599.
[20] F.M. Gardner, “Charge-pump phase-lock loops,” IEEE Transactions on
Communications, vol. 28, no. 11, Nov. 1980, pp. 1849-1858.
[21] M. Johnson, “A variable delay line PLL for CPU-coprocessor synchronization,” IEEE
Journal of Solid-State Circuits, vol. 23, no. 5, Oct. 1988, pp. 1218-1223.
[22] M. Mansuri et al, “Jitter optimization based on phase-locked-loop design parameters,”
IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2002.
[23] M. Horowitz et al, “High-speed electrical signalling: Overview and limitations,” IEEE
Micro, vol. 18, no. 1, Jan.-Feb. 1998, pp.12-24.
[24] S. Sidiropoulos and M. Horowitz, “A semi-digital dual delay-locked loop,” IEEE
Journal of Solid-State Circuits, Nov. 1997, pp. 1683-1692.
[25] K. -Y. K. Chang et al, “A 0.4-4Gb/s CMOS quad transceiver cell using on-chip regulated
dual-loop PLLs,” IEEE Symposium on VLSI Circuits, accepted for publication June
2002.
[26] W.J. Dally et al. Digital Systems Engineering, Cambridge University Press, 1998.
[27] W. J. Dally et al, “Transmitter equalization for 4-Gbps signalling” IEEE Micro, Jan.-
Feb. 1997. vol. 17, no. 1, pp. 48-56.
238 Energy-Efficient Design of High-Speed Links

[28] R. Farjad-Rad et al, CMOS 8-GS/s 4-PAM Serial Link Transceiver,” IEEE
Symposium on VLSI Circuits Dig. Tech. Papers, pp.41-44.
[29] A. Fieldler et al, “A 1.0625 Gbps transceiver with 2X oversampling and transmit pre-
emphasis,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1997, pp. 238-
239.
[30] A.P. Chandrakasan et al, Low Power Digital CMOS Design. Norwell, MA: Kluwer
Academic, 1995.
[31] D. Dobberpuhl, “The design of a high performance low power microprocessor,” IEEE
Int’l Symposium on Low Power Electronics and Design Dig. Tech. Papers, Aug. 1996,
pp. 11-16.
[32] M. Horowitz, “Low power processor design using self-clocking,” Workshop on Low-
Power Electronics, 1993.
[33] Zerbe et al, “A 2Gb/s/pin 4-PAM parallel bus interface with transmit crosstalk
cancellation, equalization, and integrating receivers,” IEEE Int’l Solid-State Circuits
Conf. Dig. Tech. Papers, Feb. 2001, pp. 66-67.
[34] C. -K. Yang, “Design of high-speed serial links in CMOS,” Ph.D. dissertation, Stanford
University, Stanford, CA, Decemeber 1998.
[35] D. Weinlader et al, “An eight channel 36Gample/s CMOS timing analyzer,” IEEE Int’l
Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 170-171.
[36] K. Yang, “A scalable 32Gb/s parallel data transceiver with on-chip timing calibration
circuits,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 258-
259.
[37] H. Johnson, “Multi-level signaling,” DesignCon, Feb. 2000.
[38] T. Burd et al, “A dynamic voltage scaled microprocessor system,” IEEE Int’l Solid-State
Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 294-295.
[39] P. Maken, M. Degrauwe, M. Van Paemel and H. Oguey, “A voltage reduction technique
for digital systems,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1990,
pp238-239.
[40] G. Wei et al, “A full-digital, energy-efficient adaptive power supply regulator,” IEEE
Journal of Solid-State Circuits, vol. 34, no. 4, April 1999, pp. 520-528.
[41] A. P. Chandrakasan et al, “Data driven signal processing: An approach for energy
efficient computing,” IEEE Int’l Symposium on Low Power Electronics and Design Dig.
Tech. Papers, Aug. 1996, pp. 347-352.
[42] V. Gutnik et al, An efficient controller for variable supply voltage low power
processing,” IEEE Symposium on VLSI Circuits Dig. Tech. Papers, June 1996, pp. 158-
159.
[43] L. Nielsen et al, “Low-power operation using self-timed circuits and adaptive scaling of
supply voltage,” IEEE Trans. VLSI Systems., vol. 2, pp 391-397, Dec 1994.
[44] A. J. Stratakos, “High-efficiency low-voltage DC-DC conversion for portable
applications,” Ph.D. dissertation, University of California, Berkeley, CA, Dec. 1998.
[45] K. Suzuki et al, “A 300 MIPS/W RISC core processor with variable supply-voltage
scheme in variable threshold-voltage CMOS,” Proceedings of the IEEE Custom
Integrated Circuits Conference, May 1997, pp. 587-590.
[46] J. Kim et al, “A digital adaptive power-supply regulator using sliding control,” IEEE
Symposium on VLSI Circuits Dig. Tech. Papers, June 2001.
[47] S. Sidiropoulos et al, “Adaptive bandwidth DLL’s and PLL’s using regulated-supply
CMOS buffers,” IEEE Symposium on VLSI Circuits Dig. Tech. Papers, June 2000.
[48] J.G. Maneatis, “Low-Jitter process independent DLL and PLL based on self-biased
techniques,” IEEE Journal of Solid-State Circuits, vol. 28, no. 12, Dec. 1993.
Summary 239

[49] F. Bilaovic et al, “Sliding modes in electrical machines control systems,” IEEE Int’l
Symp. on Industrial Electronics Conference Proceedings, 1992, pp. 73-78.
[50] G. Wei et al “A low power switching power supply for self-clocked systems,” IEEE
Symposium on Low Power Electronics, Oct. 1996, pp. 313-317.
[51] R.C. Walker et al “A two-chip 1.5-GBd serial link interface,” IEEE Journal of Solid-
State Circuits, vol. 27, no. 12, Dec. 1992, pp. 1805-1811.
[52] F.M. Gardner, “Frequency granularity in digital phase-lock loops,” IEEE Transactions
on Communications, vol. 44, no. 6, June 1996, pp. 749-758.
[53] M. -J. E. Lee et al, “An 84-mW 4-Gb/s clock and data recovery circuit for serial link
applications,” IEEE Symposium on VLSI Circuits Dig. Tech. Papers, June 2001.
[54] L. Geppert, “Transmeta’s magic show [microprocessor chips],” IEEE Spectrum, vol. 37,
no. 5, May 2000, pp. 26-33.
[55] R. Ho et al, “Interconnect scaling implications for CAD,” IEEE/ACM Int’l Conf.
Computer Aided Design Dig. Tech. Papers, Nov. 1999, pp. 425-429.
[56] P. Larsson, “Measurement and analysis of PLL jitter caused by digital switching noise,”
IEEE Journal of Solid-State Circuits, July 2001, vol. 37, no. 7, pp. 1113-1119.
[57] J.G. Maneatis, “Precise delay generation using coupled oscillators,” Ph.D. dissertation,
Stanford University, Stanford, CA, June 1994.
This page intentionally left blank
Chapter 9
System and Microarchitectural Level Power
Modeling, Optimization, and Their Implications in
Energy Aware Computing

Diana Marculescu and Radu Marculescu


Carnegie Mellon University

Abstract: While it is recognized that power consumption has become the limiting factor
in keeping up with increasing performance trends, static or point solutions for
power reduction are beginning to reach their limits. System level
power/performance design exploration for exposing available trade-offs and
achievable limits for various metrics of interest has become an indispensable
step in the quest for shortening the time-to-market for today’s complex
systems. Of particular interest are fast methods for power and performance
analysis that can guide the design process of portable information systems. At
the same time, support is needed at the microarchitectural level for efficient
design exploration for low power or application-driven fine grain power
management. Energy-aware computing is intended to provide a solution to
how various power-reduction techniques can be used and orchestrated such
that the best performance can be achieved within a given power budget, or the
best power efficiency can be obtained under prescribed performance
constraints. The paradigm of energy-aware computing is intended to fill the
gap between gate/circuit-level and system-level power management techniques
by providing more power-management levels and application-driven
adaptability.

Key words: System-level power modelling, stochastic automata networks,


microarchitecture power modelling, energy-aware computing.

9.1 INTRODUCTION

Power consumption has become the limiting factor not only for portable,
embedded applications but also for high-performance or desktop systems.
While there has been notable growth in the use and application of these
systems, their design process has become increasingly difficult due to the
increasing design complexity and shortening time-to-market. The key factor
242 System and Microarchitectural Level Power Modeling, etc.

in the design process of these systems is the issue of efficient power-


performance estimation that can guide the system designer to make the right
choice among several candidate architectures that can run a set of selected
applications.
In this chapter, a design exploration methodology is presented that is
meant to discover the power/performance trade-offs that are available at both
the system and microarchitectural levels of abstraction. While the
application is the main driver for determining the best architectural choices
and power/performance trade-offs, the mapping process produces a platform
specific workload that characterizes the programmable IP-core(s) in the
system under consideration. Such workloads can be used in the IP-core
refinement step so as to achieve better power efficiency, under given
performance constraints, or better adapt the IP-core resource usage to the
application needs. Having such a methodology available can help the
designer select the right platform starting from a set of target applications
[1]. A platform is a family of heterogeneous architectures (which consist of
both programmable and dedicated components) that satisfy a set of
architectural constraints imposed to allow re-use of hardware and software
components [2]. Although the proposed system-level analysis methodology
is completely general, the initial focus is on portable embedded multimedia
systems (e.g., slim hosts like PDAs, network computers). For these systems,
as opposed to the reactive embedded systems used in safety critical
applications, the average behavior is far more important than the worst-case
behavior.
As important as the system-level design exploration step, the
microarchitectural level presents additional challenges and issues that need
to be addressed [3]. As such, another focus of this chapter is on
microarchitectural power analysis and optimization for high-end processors,
characterized by either multimedia, or more general workloads (such as
SPEC benchmarking). High-end processors (such as superscalar, out-of-
order machines) are analyzed in the context of efficient design exploration
for power-performance trade-off, as well as their potential for application-
driven adaptability for energy-aware computation.

9.2 SYSTEM-LEVEL MODELING AND DESIGN


EXPLORATION

Performance evaluation is a fundamental problem for embedded systems


design. Indeed, embedded systems interact with the outside world, and, in
many cases, their interactions have to satisfy strict timing constraints.
Because of this, most of the research so far has been geared towards worst-
System-level Modeling and Design Exploration 243

case analysis, where the correctness of the system depends not only on the
logical results of computation but also on the time at which the results are
produced [4] [5]. Despite the great potential for embedded system design, the
area of average-case analysis has received little attention [6][7][8][9].
However, the area of average-case analysis is becoming more and more
important, and having abstract representations to provide quantitative
measures of power/performance estimates will play a central part. Tools
based on analytic solutions for application-architecture modeling for
performance evaluation are becoming extremely important due to their
potential to significantly shorten the design cycle and allow a better
exploration of the design space [10]. The methodology presented here
complements the existing results for worst-case time analysis and is distinct
from other approaches for performance analysis based on rate analysis [11],
time separation between events [12], and adaptation process [5]. Along the
same lines of using formal models, Thoen and Catthoor [13] address the
problem of generating embedded software efficiently starting from a model
of the behavior of the system being designed. The mathematical properties
of the models are used to drive the synthesis process, with the main objective
of reaching an optimal solution while guaranteeing strict timing constraints.
On the other hand, the existing tools for high-level performance
modeling that can be used in embedded systems design, like Ptolemy [14]
and POLIS [15], focus on application modeling but do not support explicit
mapping of application models onto models of architectures. These tools
share a simulation-based strategy for performance evaluation so they may
require prohibitively long simulation times on real examples. El Greco [16]
and Cadence VCC [17] provide a simulation environment for modeling and
validating the functionality of complex heterogeneous systems. Finally, the
tools recently proposed in [18] are centered on the idea of platform-based
design. The applications are modeled as Kahn process networks that are
further used to perform performance evaluation via simulation.
In what follows, a methodology for system-level power/performance
analysis based on Stochastic Automata Networks (SANs) [19] is presented.
While the methodology described is completely general, the focus of our
attention is on portable embedded multimedia systems. These systems are
characterized by “soft” real-time constraints, and hence, as opposed to safety
critical systems, the average behavior is far more important than their worst-
case behavior. Moreover, due to data dependencies, their computational
requirements show such a large spectrum of statistical variations that
designing them based on the worst-case behavior (typically, orders of
magnitude larger than the actual execution time [5]) would result in
completely inefficient systems.
244 System and Microarchitectural Level Power Modeling, etc.

Typically, the design of heterogeneous architectures follows the Y-chart


scheme [18]. In this scheme, the system designer first characterizes the set of
applications and chooses a possible collection of candidate architectures to
run that set. The application is mapped onto the architectural components
and the performance of the system is then evaluated. Based on the resulting
performance numbers, one may decide to choose a particular architecture.
Otherwise the designer may restructure the application or modify the
mapping of the application to attain better performance numbers.

Relying upon the Y-chart design methodology, in what follows, the focus
is on the application-architecture modeling process for embedded
multimedia systems. The big picture of a formal methodology for
performance estimation is presented in Figure 9.1. Following the principle
of orthogonalization of concerns during the design process, separate SAN
models are built for both applications and architectures. Next, the abstract
model of an application is mapped onto a family of architectures (platform)
and the power-performance figures are evaluated to see how suited is the
platform (and the chosen set of design parameters) for the target application.
This process can be re-iterated with a different set of parameters until
convergence to the desired result.
This global vision has several unique features: First, the methodology is
based on integrating the power-performance metrics into the system-level
design. Indeed, the performance metrics that are being developed become an
integral part of the design process; this helps systems designers to quickly
find the right architecture for the target application. Second, using the same
The SAN Modeling Paradigm 245

unique representation based on SANs, for both application and architecture,


gives the ability to smoothly translate a performance model into an
architecture (obtain real numbers on performance) and reflect architectural
changes back to the performance model.
Using SANs as an effective formalism in system-level analysis is very
useful since embedded applications are highly concurrent and, consequently,
do not easily fit the traditional model of sequential control flow. Another
major advantage of SANs over other formalisms is that the state space
explosion problem associated with the Markov models (or Petri nets) is
partially mitigated by the fact that the state transition matrix is not stored nor
even generated [20][21].
The models built for applications are process-level functional models that
are free of any architectural details. These processes communicate and
interact among themselves, defining what the application should do and not
how it will be implemented. On the other hand, the architecture models
represent behavioral descriptions of the architectural building blocks.
Typically, these building blocks may consist of several programmable cores
or dedicated hardware units (computation resources), communication
resources (buses), and memory resources (RAMs, FIFO buffers). A
separation of concerns between application and architecture enables the
reuse of both application and architecture models and facilitates an
explorative design process in which application models are subsequently
mapped onto architecture models.
Once built, the application-architecture model is evaluated to determine
the characteristics of the processes for different input parameters. While
model evaluation is a challenging problem by itself, analytical performance
model evaluation presents additional challenges. For this, this chapter
presents a fully analytical framework using SANs, which helps to avoid
lengthy profiling simulations for predicting power and performance figures.
This is important for multimedia systems where many runs are typically
required to gather relevant statistics for average-case behavior. Considering
that 5 min. of compressed MPEG-2 video needs roughly 1.2 Gbits of input
vectors to simulate, the impact of having such a tool to evaluate
power/performance estimates becomes evident.

9.3 THE SAN MODELING PARADIGM

SANs present a modular state-transition representation for highly concurrent


systems. The main objective of SAN analysis is the computation of the
stationary probability distribution for an N-dimensional system consisting
of N stochastic automata that operate more or less independently. This
246 System and Microarchitectural Level Power Modeling, etc.

involves two major steps: 1) SAN model construction and 2) SAN model
evaluation. The following sections briefly describe these two steps.

9.3.1 The SAN Model Construction

The SAN model can be described using continuous-time Markov processes


that are based on infinitesimal generators8 defined as:

with
and where is the transition probability

(directly or indirectly) from state i to state j during time 0 to t, and


represents its derivative. Each entry in the infinitesimal generator is in
fact the execution rate of the process in that particular state.

An N-dimensional SAN consists of N stochastic automata that operate


more or less independently of each other. The number of states in the
automaton is denoted by k = 1, 2,..., N. The main objective of the SAN
analysis is the computation of the stationary probability distribution of the
overall N-dimensional system.
To solve the N-dimensional system that is formed from independent
stochastic automata, it suffices first to solve the probability distributions of
each individual stochastic automata and then form the tensor product of
these distributions. Although such systems may exist, for embedded
applications, interactions among processes must be considered. There are
two ways in which the stochastic automata can interact:

8
This generator is the analogue of the transition probability matrix in the discrete-time
Markov chains.
The SAN Modeling Paradigm 247

A transition in one automaton forces a transition to occur in one


or more other automata. These are called synchronizing transitions
(events). Synchronizing events affect the global system by altering the
state of possibly many automata. In any given automaton, transitions
that are not synchronized are said to be local transitions.
The rate at which a transition may occur in one automaton is a
function of the state of another automata. These are functional tran-
sitions, as opposed to constant-rate (non-functional) transitions.
Functional transitions affect the global system only by changing the
state of a single automaton.
The effect of synchronizing events
Let (k = 1, 2,..., N) be the matrix consisting only of local transitions of
any automaton k. Then, the part of the global infinitesimal generator that
consists uniquely of local transitions can be obtained by forming the tensor
sum of matrices It has been shown in [20], that SANs
can be always treated by separating out the local transitions, handling them
in the usual fashion by means of a tensor sum and then incorporating the
sum of two additional tensor products per synchronizing event. More
importantly, since tensor sums are defined in terms of the (usual) matrix sum
(of N terms) of tensor products, the infinitesimal generator of a system
consisting of N stochastic automata with E synchronizing events (and no
functional transition rates) can be written as:

This quantity is referred to as the global descriptor of the SAN. It should


be noted that, even if the descriptor can be written as a sum of tensor
products, the solution is not simply the sum of the tensor products of the
vector solutions of individual This directly results from the fact that the
automata that we are considering are not independent.
The effect of functional transition rates
Introducing functional transition rates has no effect on the structure of the
global transition rate matrix other than, when functions evaluate to zero, a
degenerate form of the original structure is obtained. So, although the effect
of the dependent interactions among the individual automata prevents one
from writing the solution as a tensor product of individual solutions, it is still
possible to take advantage of the fact that the nonzero structure is
unchanged. This is the motivation behind the generalized tensor product
248 System and Microarchitectural Level Power Modeling, etc.

[21]. The descriptor is still written as in equation (9.2), but now its elements
can be functions. In this case, each tensor product that incorporates matrices
with functional entries is replaced with a sum of tensor products of matrices
that incorporate only average numerical entries. Then equation (9.2)

becomes where contains only numerical values, and the

size of T depends on 2E + N and on where F is the set of automata

whose state variables are arguments in functional transition rates. Although


T may be large, it is bounded by

9.3.2 Performance Model Evaluation

Once the SAN model is obtained, one needs to calculate its steady-state
solution. This is simply expressed by the solution of the equation

with the normalization condition where is steady-state probability


distribution, and Equation (9.3) can be solved using numerical
methods that do not require the explicit construction of the matrix but can
work with the descriptor in its compact form. For instance, the power
method, which is applied to the discretized version of can be used. The
iterative process therefore becomes

where As can be seen here, the operation that needs to

be computed very efficiently is Exploiting the properties of the


tensorial product (which is unique to the SAN model!), this can be done
using only multiplications, where is the numbers of states in
Case Study: Power-performance of The MPEG-2 Video Decoder 249

the i-th automaton [20]. Note that this is far better than the brute-force

approach that would require multiplications.

Once the steady-state distribution is known, performance measures


such as throughput, utilization, and average response time can be easily
derived. However, in order to calculate such performance figures, one needs
to find the true rates of the activities, which in turn require calculating the
probability that each activity is enabled. This is because the specified rate of
an activity is not necessarily the same as the rate of that activity in the
equilibrium state since bottlenecks elsewhere in the system may slow the
activity down. The true (or equilibrium) rate of an activity is thus specified
by the rate multiplied by the probability that the activity is enabled.

9.4 CASE STUDY: POWER-PERFORMANCE OF THE


MPEG-2 VIDEO DECODER APPLICATION

9.4.1 System Specification

As shown in Figure 9.2, the decoder consists of the baseline unit, the Motion
Compensation (MV) and recovery units, and the associated buffers [22]. The
baseline unit contains the VLD (Variable Length Decoder), IQ/IZZ (Inverse
Quantization and Inverse Zigzag), and IDCT (Inverse Discrete Cosine
Transform) units, as well as the buffers. During the modeling process, each
250 System and Microarchitectural Level Power Modeling, etc.

of these units are modeled as processes, the corresponding SANs are


generated.
To specify the system, the Stateflow component of Matlab, which uses
the semantics of Statecharts [23], is employed. Statecharts extend the
conventional state diagrams with the notion of hierarchy, concurrency, and
communication. This is important since we aim to analyze how the
asynchronous nature of concurrent systems can affect their run-time
behavior.

9.4.2 Application Modeling

To model the application of interest, a process graph is used, where each


component corresponds to a process in the application. Communication
between processes can be achieved using various protocols, including simple
ones using event and wait synchronization signals. Process graphs are also
characterized by execution rates that, under the hypothesis of exponential
distribution of the sojourn times, can be used to generate the underlying
Markov chain [21]. In the SAN-based modeling strategy, each automaton
corresponds to a process in the application. Hence, the whole process graph
specifying the embedded system translates to a network of automata.
The entire process graph that corresponds to MPEG-2 is modeled
following the Producer-Consumer paradigm. To unravel the complete con-
currency of processes that describe the application, it is assumed that each
process has its own space to run and does not compete for any computing
resource. For the sake of simplicity, the SAN model is presented in Figure
9.3 for the baseline unit.
Referring to the Producer process (VLD), one can observe a local transition
between produce(item) and wait_buffer states; that is, this transition occurs
at the fixed rate of where is the required time to produce one
item. The transition from the state wait_buffer to the state write is a
functional transition because it depends on the state of the other process.
More precisely, this transition happens if and only if the process IDCT is not
reading any data and the buffer is not full. Because of this dependency, one
cannot associate a fixed rate to this transition; the actual rate will depend
on the overall behavior of the system. Finally, once the producer gets access
to the buffer, it transitions to the initial state (the local transition rate is
The same considerations apply to the Consumer process (IDCT/IQ).
Once built, the model of the application is evaluated using the analytical
procedure in Section 3 based on exploiting the SAN analysis technique. To
this end, one needs to construct the infinitesimal generator matrices
corresponding to each automaton from the Stateflow diagrams. This involves
deriving the matrices corresponding to synchronization and functional
Case Study: Power-performance of The MPEG-2 Video Decoder 251

transitions, apart from those corresponding to local transitions. These


transition matrices incorporate rate information, provided by the designer
from trace-driven simulations of the application.

9.4.3 Platform Modeling

This modeling step starts with an abstract specification of the platform (e.g.,
Stateflow) and produces a SAN model that reflects the behavior of that
particular specification. A library of generic blocks that can be combined in
a bottom-up fashion to model sophisticated behaviors is constructed.
The generic building blocks model different types of resources in an
architecture, such as processors, communication resources, and memory
resources. Defining a complex architecture thus becomes as easy as
instantiating building blocks from a library and interconnecting them.
Compared to the laborious work of writing fully functional architecture
models (in Verilog/VHDL), this can save the designer a significant amount
of time and therefore enable exploration of alternative architectures.
Architecture modeling shares many ideas with the application modeling
that was just discussed. In Figure 9.4 a few simple generic building blocks
are illustrated. In Figure 9.4(a), the generic model of a CPU is represented
based on a power-saving architecture. Normal-mode is the normal operating
mode when every on-chip resource is functional. StopClock mode offers the
greatest power savings and, consequently, the least functionality. Finally,
Figure 9.4(b) describes a typical memory model.
252 System and Microarchitectural Level Power Modeling, etc.

9.4.4 Mapping

After the application and the architecture models are generated, the next step
is to map the application onto architecture and then evaluate the model using
the analytical procedure in Section 3. To make the discussion more specific,
let us consider the following design problem.
The design problem
Assume that we have to decide how to configure a platform that can work in
four different ways: one that has three identical CPUs operating at clock
frequency (then each process can run on its own processor) and another
three architectures where we can use only one physical CPU but have the
freedom of choosing the speed among the values
Mapping the simple VLD-IDCT/IQ processes in Figure 9.3 onto a
platform with a single CPU is illustrated in Figure 9.5. Because the two
processes9 now have to share the same CPU, some of the local transitions
become synchronizing/functional transitions (e.g., the local transitions with
9
For simplicity, the second consumer process (for the MV unit) was not explicitly represented
in this figure.
Results and Discussion 253

rates or become synchronized). Moreover, some new


states (e.g., wait_CPU) have to be introduced to model the new
synchronization relationship.
To complete the mapping, another process is needed, namely the
scheduler. This process determines the sequence in which the various con-
current processes of the application run on the different architectural
components, particularly if the resource is shared. This process can be
implemented in software or hardware but since the SAN representation is
uncommitted, this new component can be added easily to the entire network
of automata. This completes all the steps of the modeling methodology.

9.5 RESULTS AND DISCUSSION

For all the experiments, the following parameters were used: five slot buffers
(that is, n = 5, where one entry in the buffer represents one block of the 64
DCT coefficients that are needed for one IDCT operation),

and
254 System and Microarchitectural Level Power Modeling, etc.

9.5.1 Performance Results

For a platform with three separate CPUs, the analysis is quite simple: the
system will essentially run in the CPU-active state or will either be waiting
for the buffer or writing into the buffer most of the time. The average length
values are 1.57 and 0.53 for the MV and baseline unit buffers, respectively.
This is in sharp contrast with the worst-case scenario assumption where the
lengths will be 4 across all runs.
For a platform with a single CPU, the probability distribution values for
all the components of the system are given in Figure 9.6. The first column in
these diagrams shows the probability of the processes waiting for their
respective packets to arrive. The second column shows the probability of the
process waiting for the CPU to become available (because the CPU is shared
Results and Discussion 255

among all three processes). The third column represents the probability of
the processes actively using the CPU to accomplish their specific tasks. The
fourth column shows the probability of the process being blocked because
the buffer is not available (either it is full/empty or being used by another
process). The fifth column shows the probability of the processes writing
into their corresponding buffers.
Run 1 represents the “reference” case where the CPU operates at frequency
while the second and the third runs represent the cases when the CPU
speed is and respectively. For instance, in run 1, the Producer (VLD)
is waiting with probability 0.01 to get its packets, waiting for CPU access
with probability 0.3, decoding with probability 0.4, waiting for the buffer
with probability 0.27, and finally writing the data into the buffer with
probability 0.02.
Looking at the probability distribution values of MV and baseline unit
buffers10 (Figure 9.7), one can see that a bottleneck may appear because of
the MV buffer. More precisely, the system is overloaded at run1, balanced at
run 2, and under-utilized at run 3. The average buffer lengths in runs 1,2, and
3 are:
MV buffer: 3.14, 1.52, and 1.15
baseline unit buffer: 0.81, 0.63, and 0.54
respectively. Since the average length of the buffers is proportional to the
average waiting time (and therefore directly impacts the system
performance), one can see that, based solely on performance figure, the best
choice would be a single CPU with speed Also, notice how different the
average values (e.g., 1.15 and 0.54, respectively) are compared to the value 4

10
The columns in the Buffer Diagrams show the distribution of the buffer occupancy ranging from 0
(empty) to 4 (full).
256 System and Microarchitectural
. Level Power Modeling, etc.

provided by a worst-case analysis. Not only is this worst-case length is about


4 times larger than the average one but it also occurs in less than 6% of the
time. Consequently, designing the system based on worst-case analysis will
result in completely inefficient implementations.

9.5.2 Power Results

The average system-level power can be obtained by summing up all the


subsystem-level power values. For any subsystem k, the average power
consumed is given by:

where and represent the power consumption per state and per
transition, respectively, and is the steady-state probability of state i, and
is the transition rate associated with the transitions between states i and j.
Having already determined the solution of equation (9.3), the value (for a
particular i) can be found by summing up the appropriate components of the
global probability vector The and costs are determined during an
off-line pre-characterization step where other proposed techniques can be
successfully applied [24][25].

To obtain the power values, the MPEG-2 decoder was simulated, using
the Wattch [26] architectural simulator that estimates the CPU power
consumption, based on a suite of parameterized power modes. More
precisely, the simulation of the MPEG-2 was monitored and the power
Microarchitecture-level Power Modeling 257

values for all components of the system were extracted. By specifying a


Strong-ARM like processor, an average power value of 4.6W for the VLD,
4.8W for the IDCT, and 5.1W for the MV unit was obtained. Using these the
power figures, the average power characterization was obtained for the entire
system under varying loads. This is useful to trade-off performance and
power. In this case, using equation (9.5), the (total) average power values of
18.75W, 13.68W, 15.08W for run 1, 2, and 3, respectively, were obtained.
For a more detailed analysis, the breakdown of power-consumption is
given in Figure 9.8. It can be seen that there is a large variation among the
three runs with respect to both the CPU-active power and the power
dissipation of the buffers. Furthermore, these power values can be multiplied
with the average buffer lengths from Figure 9.7 (3.95, 2.15, and 1.69, for
runs 1, 2, and 3, respectively), and the power×delay characterization of the
system can be obtained; that is, 58.9 20.8
and 17.3 (about 70% less) for runs 1, 2, and 3, respectively.
This analysis shows that the best choice would be to use the third
configuration (e.g., CPU running at ) since, for the given set of
parameters, it represents the best application-architecture combination. (This
choice is also far better than using three separate CPUs all running at )
Finally, the CPU time needed for this analysis is at least several orders of
magnitude better than the active simulation time required to obtain the same
results with detailed simulation. Hence, the approach can significantly cut
down the design cycle time and, at the same time, enhance the opportunities
for better design space exploration.

9.6 MICROARCHITECTURE-LEVEL POWER


MODELING

To characterize the quality (in terms of power and performance) of different


microarchitectural configurations, we need to rely on a few metrics of
interest. As pointed out in [27], when characterizing the performance of
modern processors, the CPI (Cycles per Instruction) or IPC (Instructions per
Cycle, 1/CPI) is only one of two parameters that needs to be considered, the
second one being the actual cycle time. Thus, the product is a
more accurate measure for characterizing the performance of modern
processors. In the case of power consumption, most researchers have
concentrated on estimating or optimizing energy per committed instruction
(EPI) or energy per cycle (EPC) [26][28][29]. While in the case of
embedded computer systems with tight power budgets some performance
may be sacrificed for lowering the power consumption, in the case of high-
performance processors this is not desirable, and solutions that jointly
258 System and Microarchitectural Level Power Modeling, etc.

address the problem of low power and high performance are needed. To this
end, the energy delay product per committed instruction (EDPPI), defined as
has been proposed as a measure that characterizes both the
performance and power efficiency of a given architecture. Such a measure
can identify microarchitectural configurations that keep the power
consumption to a minimum without significantly affecting the performance.
In addition to classical metrics (such as EPC and EPI), this measure can be
used to assess the efficiency of different power-optimization techniques and
to compare different configurations as far as power consumption is
concerned.
Most microarchitectural-level power modeling tools for high-
performance processors consider a typical superscalar, out-of-order
configuration, based on the reservation station model (Figure 9.1). This
structure is used in modern processors like Pentium Pro and PowerPC 604.
The main difference between this structure and the one used in other
processors (e.g., MIPS R10000, DEC Alpha 21264, HP PA-8000) is that the
reorder buffer holds speculative values and the register file holds only
committed, non-speculative data, whereas for the second case, both
speculative and non-speculative values are in the register file. However, the
wake-up, select, and bypass logic are common to both types of architectures,
and, as pointed out in [27], their complexity increases significantly with
increasing issue widths and window sizes.
As expected, there is an intrinsic interdependency between processor
complexity and performance, power consumption and power density. It has
been noted [30] that increasing issue widths must go hand in hand with
increasing instruction window sizes to provide significant performance
gains. In addition, it has been shown that the complexity [31] (and thus,
power requirements) of today’s processors has to be characterized in terms
of issue width (that is, number of instructions fetched, dispatched, and
executed in parallel), instruction window size (that is, the window of
instructions that are dynamically reordered and scheduled for achieving
higher parallelism), and pipeline depth, which is directly related to the
operating clock frequency.
One of the most widely used microarchitectural power simulators for
superscalar, out-of-order processors is Wattch [26], which has been
developed using the infrastructure offered by SimpleScalar [32].
SimpleScalar performs fast, flexible, and accurate simulation of modern
processors that implement a derivative of the MIPS-IV architecture [33] and
support superscalar, out-of-order execution, which is typical for today’s
high-end processors. The power estimation engine of Wattch is based on the
SimpleScalar architecture, but in addition, it supports detailed cycle-accurate
information for all modules, including datapath elements, memory and CAM
Microarchitecture-level Power Modeling 259

(Content-Addressable Memory) arrays, control logic, and clock distribution


network. Wattch uses activity-driven, parameterizable power models, and it
has been shown to be within 10% accurate when compared against three
different architectures [26].

Note, however, that of equal importance is the use of data dependent


models for datapath modules (ALUs, multipliers, etc.). As shown in [30],
Wattch concentrates on accurately modeling the memory arrays using
capacitance models very similar to the previously proposed Cacti tools
[27] [34] [35] but that can be enhanced by using cycle-accurate models for
datapath modules like integer and floating point ALUs and multipliers. Also,
of equal importance is the use of parameterizable models for global clock
power as a function of pipeline depth and configuration.
For accurate estimates, the power models used for the datapath
modules can be based on input-dependent macromodels [36]. The input
statistics are gathered by the underlying detailed simulation engine and used,
together with technology-specific load capacitance values, to obtain power-
consumption values. Assuming a combination of static and dynamic CMOS
implementations, one can use a cycle-accurate power macromodeling
approach for each of the units of interest [36]:

where is the power consumption of a given module during cycle k


when input vector is followed by
260 System and Microarchitectural Level Power Modeling, etc.

While estimation accuracy is important for all modules inside the core
processor, it is recognized that up to 40-45% of the power budget goes into
the global clock power [37]. Thus, accurate estimation of the global clock
power is particularly important for evaluating power values of different core
processor configurations. Specifically, the global clock power can be
estimated as a function of the die size and number of pipeline registers
[30][38]:

where the first term accounts for the register load and the second and third
terms account for the global and local wiring capacitance ( is a term that
depends on the local routing algorithm used, and h is the depth of the H-
tree). is the nominal input capacitance seen at each clocked register, and
is the wire capacitance per unit length, while is the number of pipeline
registers, for p pipeline stages.
To estimate the die size and number of clocked pipeline registers, the
microarchitectural configuration can be used as follows:

where memory and CAM arrays (Content-Addressable Memory) account for


caches, TLBs, branch prediction table, rename logic, and instruction
window; functional units are the integer and floating point units; and clock
refers to the clock distribution tree and clocked pipeline registers. To
estimate the size of each module, we rely on the wirelength and module size
calculation done in Cacti, which is at the basis of latency estimation.
A useful set of estimation tools for the area of different memory arrays
(I-cache, D-cache, TLB, branch prediction table, instruction issue window,
etc.) and their latency are Cacti tools, which provide accurate models (within
5-7% error when compared to HSPICE) for estimating load capacitances
based on realistic implementations of different memory arrays and CAM
structures. In addition, Cacti relies on load calculation based on RC models
using wirelength estimation and appropriate scaling among different
technologies. Similar models are used for power modeling array and CAM
structures in Wattch. For a complete analysis of wirelength, module size, or
latency, refer to [27][34][35][38][39].
Efficient Processor Design Exploration for Low Power 261

9.7 EFFICIENT PROCESSOR DESIGN


EXPLORATION FOR LOW POWER

Today’s superscalar, out-of-order processors pack a lot of complexity and


functionality on the same die. Hence, design exploration to find high
performance or power efficient configurations is not an easy task. As shown
previously [40][41][42][43][44][30], some of the factors that have a major
impact on the power/performance of a given processor are issue width, cache
configuration, etc. However, as shown in [26], the issue window strongly
impacts the power cost of a typical superscalar, out-of-order processor. As
shown in [30], the issue width (and corresponding number of functional
units), instruction window size, as well as the pipeline depth have the largest
impact as parameters in a design exploration environment.

A possible design exploration environment follows the flow in Figure


9.10. At the heart of the exploration framework is a fast microarchitectural
simulator (estimate_metrics) that provides sufficiently accurate estimates
for the metric of interest. Depending on the designer’s needs, this metric can
be one of: CPI, EPI, or EDPP1, depending on whether a high
performance or a joint high-performance and energy-efficient organization is
sought. As shown in Figure 9.10, the exploration is performed for a set of
benchmarks B, a set of possible issue widths I, instruction window sizes W,
and a number of possible voltage levels N. For each pair (issue width,
instruction window size), the stage latencies are estimated. If a balanced
pipelined design is sought, the pipeline is further refined to account for this,
and only one voltage level is assumed for the entire design. Otherwise,
depending on the latencies of the different stages, up to N different voltages
262 System and Microarchitectural Level Power Modeling, etc.

are assigned to different modules such that performance constraints are


maintained, and the slowest stage dictates the operating clock frequency.

9.7.1 Efficient Microarchitectural Power Simulation

For a design exploration environment to be able to explore many possible


design configurations in a short period of time, it has to rely either on a
smart methodology to prune the design space or on a fast, yet sufficiently
accurate estimation tool for the metrics of interest. One of the approaches
relying on the latter type of technique has been presented in [30].
The crux of the estimation speed-up methodology relies on a two-level
simulation methodology: for critical parts of the code, an accurate, lower-
level (but slow) simulation engine is invoked, whereas for non-critical parts
of the application program, a fast, high-level, but less accurate simulation is
performed. Following the principle “make the common case accurate,” ideal
candidates for critical sections that should be modeled accurately are those
pieces of code in which the application spends a lot of time, which have been
called hotspots [45].
Example: Consider the collection of basic blocks11 in Figure 9.11, where
edges correspond to conditional branches and the weight of each edge is
proportional to the number of times that direction of the branch is visited.

Hotspots are collections of basic blocks that closely communicate one to


another but are unlikely to transition to a basic block outside of that
collection. In Figure 9.11, basic blocks 1-4 and 5-9 are part of two different
hotspots that communicate infrequently to one another. As will be seen
later, these hotspots satisfy nice locality properties not only temporally, but

11
A basic block is a straight-line piece of code ending at any branch or jump instruction.
Efficient Processor Design Exploration for Low Power 263

also in terms of the behavior of the metrics that characterize power


efficiency and performance. Temporal locality as well as the high probability
of reusing internal variables [46] make hotspots attractive candidates for
sampling metrics of interest over a fixed sampling window after a warm-up
period that would take care of any transient regimes. Estimated metrics
obtained via sampling can be reused when the exact same code is run again.
Although different, such an approach is similar in some ways to power-
estimation techniques for hardware IPs using hierarchical sequence
compaction [47] or stratified random sampling [48]. In addition, the relative
sequencing of basic blocks is preserved (as in [49]), and the use of a warm-
up period ensures that overlapping of traces [50] is not necessary. This is in
contrast with synthetically constructing traces for evaluating performance
[51] and power consumption [52].
To speed-up the simulation time inside the hotspots and achieve the
goal of “making the common case fast” the sampling of power and
performance metrics can be used until a given level of accuracy is achieved.
This is supported by the fact that while being in a hotspot, both power
consumption (EPC) and performance (IPC) achieve their stationary values
within a short period of time, relative to the dynamic duration of the hotspot.
As experimental evidence shows [30], the steady-state behavior is achieved
in less than 5% of the hotspot dynamic duration, thus providing significant
opportunities for simulation speed-up, with minimal accuracy loss.

Figure 9.12 shows how the two-level simulation engine is organized.


During detailed simulation, all performance and related power metrics are
collected for cycle-accurate modeling. When a hotspot is detected, detailed
analysis is continued for the entire duration of the sampling period. When
sampling is done, the simulator is switched to basic profiling that only keeps
track of the control flow of the application. Whenever the code exits the
hotspot, detailed simulation is started again. This way, the error of
264 System and Microarchitectural Level Power Modeling, etc.

estimation is conservatively bounded by the sampling error within the


hotspots. Performing detailed simulation outside the hotspots ensures that
the estimates are still accurate for benchmarks with low temporal locality
(e.g., less than 60% time spent in hotspots).
The two-level simulation engine
To complete the two-level simulation engine, we need a reliable and
sufficiently detailed (albeit slow) power/performance simulator as well as a
rough (fast) profiler to keep track of where we are in the code. A possible
choice for the detailed simulation engine is the Wattch power simulator [26]
or a modified version with similar models for memory arrays and caches and
enhanced features for other modules:
Cycle-accurate power estimation of datapath modules like integer or
floating-point ALUs and multipliers.
Parameterizable clock power modeling as a function of the pipeline
depth and number of pipeline registers that need to be clocked.
An alternative approach is detailed microarchitectural power modeling
using labeled simulation [53] or other types of cycle-accurate power
simulation engines [31] [40].
As [30] pointed out, the two-level simulation approach is completely
general and applicable to any detailed power/performance simulation engine
as long as it is augmented with a hotspot detection mechanism for speed-up
purposes.
Identifying hotspots
As described previously, collections of basic blocks executing together very
frequently are called hotspots. It has been shown that most of the execution
time of a program is spent in several small critical regions of code, or in
several hotspots. These hotspots consist of a number of basic blocks
exhibiting strong temporal locality. In [45] a scheme for detecting hotspots
at run-time has been presented. As opposed to previous approaches that
implement the hotspot detection and monitoring mechanism in hardware,
the mechanisms can be implemented within the simulator itself. The main
advantage is that the overhead introduced in simulation time is negligible.
Sampling hotspots
As mentioned before, the main mechanism for achieving speed-up in power
/performance simulation is the fast convergence of both IPC and EPC
metrics while code is running inside a hotspot. A possible sampling scheme
is shown in Figure 9.13 [30].
After a hotspot is detected, no sampling is done for a warm-up period
needed to bypass any transient regime due to compulsory cache misses, etc.
Then, for a number of cycles denoted by the sampling window size, metrics
Efficient Processor Design Exploration for Low Power 265

of interest (committed instructions, access counts, cache misses, etc.) are


monitored and collected in lumped CPI, EPI, or EDPPI metrics that
characterize the entire hotspot. After the sampling period is over, the
simulator is switched to the fast profiling mode and then back to detailed
mode when the exit out of the hotspot is detected.

9.7.2 Design Exploration Trade-offs

Using the two-level simulation engine, fast design exploration at


microarchitectural level can be performed in less time by orders of
magnitude than full blown, detailed simulation. Assume a 4-way processor
configuration with an instruction window size of 32, a 32-entry register file,
a direct mapped I-cache of size 16K with a block size of 32B, and a 4-way
set associative D-cache of size 16K with 32B blocks. For analyzing such a
microarchitectural configuration, as detailed a simulator as modified version
of Wattch, which accounts for data dependent power values and
parameterized clock power, can be used. For the non-detailed profiling
simulation, only branch instructions need to be monitored for the purpose of
identifying if the application is still in or out of a hotspot. Results showing
simulation accuracy and relative speed-up for a 128K-cycle sampling
window are reported in Figure 9.14.
It can be seen that since most benchmarks spend most of the execution
time in hotspots, using sampling inside hotspots with a sampling window
size of 128K cycles provides between 3X and 17X improvement in
simulation time (including the overhead due to hotspot detection). In
addition, when compared to the original cycle accurate simulator, the two-
level simulation engine is within 3% accurate for EPC estimates and 3.5%
for IPC values. Thus because it is both accurate and fast, the two-level
approach is an ideal candidate for a design exploration framework.
266 System and Microarchitectural Level Power Modeling, etc.

The set of microarchitectural configurations to be explored is given by (IW,


WS) {(2,16), (4,16), (4,32), (8,32), (8,64), (8,128)}. After the latency
analysis step is completed, accesses are further pipelined if needed. As
shown in [27][34][35][38][39], the first candidates for further pipelining are
I-cache and D-cache accesses, as well as the execution stage and register
file.
In all cases, the clock rate is dictated by a different module (D-cache or
wake-up & select logic). For the benchmarks reported in Figure 9.15, the
best IPC is obtained when a wider IW and/or a large WS is used. IPC
steadily increases when IW is increased, although in some cases (e.g., gcc),
the dependence on WS is more prevalent. However, in most cases, going
from a window size of 32 to 64 or 128 brings almost no improvement in
terms of performance, and it can actually reduce the performance due to a
slower clock rate dictated by a very slow wake-up/select logic (as is the case
of (8, 128) configuration). On the other hand, the average power
consumption (EPC) is usually minimized for lower values of IW and WS, but
this reduction comes at the price of decreased performance. In fact, the total
energy consumed during the execution of a given benchmark may actually
increase due to the increased idleness of different modules.

To characterize the total energy consumption, the energy per committed


instruction (EPI) is a more appropriate measure. While in some cases (ijpeg)
EPI decreases with higher IW and increases with higher values of WS, there
are cases where EPI decreases with increasing IW (gcc). However, in
general, the lowest EPI configuration is characterized by relatively low
values for IW and WS (4 and 16, respectively). If, however, the highest
energy reduction with lowest performance penalty is sought, in all cases but
Efficient Processor Design Exploration for Low Power 267

one (gcc) the optimal configuration (i.e., lowest energy xdelay product
EDPPI) is achieved for IW = 8 and WS = 32. Although the energy is not
minimized in these cases, the penalty in performance is less than in other
cases with similar energy savings.

Thus, in terms of energy efficiency, the best configuration is not


necessarily the one that achieves the highest IPC or performance. Depending
on the actual power budget, processor designers may choose to go with
lower-end configurations without much of a reduction in performance.
268 System and Microarchitectural Level Power Modeling, etc.

9.8 IMPLICATIONS OF APPLICATION PROFILE ON


ENERGY-AWARE COMPUTING

Most solutions to the power problem are static in nature, and they do not
allow for adaptation to the application. As described in the previous section,
there is wide variation in processor resource usage among various
applications. In addition, the execution profile of most applications indicates
that there is also wide variation in resource usage from one section of an
application’s code to another. For example, Figure 9.16 shows the execution
profile of the epic benchmark (part of the MediaBench suite) on a typical
workload on an 8-way issue processor. We can see several regions of code
execution characterized by high IPC values lasting for approximately two
million cycles each; towards the end we see regions of code with much
lower IPC values. The quantity and organization of the processor's resources
will also affect the overall execution profile and the energy consumption. As
seen before, low-end configurations consume higher energy per instruction
due to their inherently high CPI; high-end configurations also tend to have
high energies due in part to resource usage and in part to the power
consumption of unused modules. The ideal operating point is somewhere in
between.

Combining the above two ideas, the optimal operating point for each
region of code can be found in terms of processor resources. The goal is to
identify the correct configuration for each code region in terms of various
processor resources to optimize the overall energy consumption. Such an
approach allows fine-grained power management at the processor level
Implications of Application Profile on Energy-aware Computing 269

based on the characteristics of the running application. Hardware profiling


can be used in conjunction with hardware-based power estimation to identify
tightly coupled regions of code and optimize them on-the-fly for low energy
cost. Allocating architectural resources dynamically based upon the needs of
the running program, coupled with aggressive clock-gating styles, can lead
to significant power savings.

9.8.1 On-the-fly Energy Optimal Configuration Detection


and Optimization

To detect tightly coupled regions of code, one can resort to the hotspot
detection mechanism described in the previous section. Once a hotspot has
been detected, an optimum configuration for that hotspot needs to be
determined. Configuration is a unique combination of several processor
parameters under control. As has been shown before, the size of the issue
window WS and the effective pipeline width IW are the two factors that most
dramatically affect the performance and energy consumption of the
processor. Hence, changing the configuration of the processor would mean
setting different values for WS and IW. The optimum is defined as that
configuration which leads to the least energy dissipated per committed
instruction. This is equivalent to the power-delay product per committed
instruction, (the inverse of MIPS per Watt) which is a metric used for
characterizing the power-performance trade-off for a given processor.

9.8.2 Energy Profiling in Hardware

To determine the optimum configuration, there needs to be a way to


determine approximate energy dissipation statistics in hardware. A possible
solution has been presented in [54][55]: when a hotspot is detected, two
counter registers are set in motion--the power register and instruction count
register (ICR). The power register is used to maintain power statistics for the
most power-hungry units of the processor. Using the organization and
modeling of Wattch [26] by collecting data on 14 benchmarks, one can
identify units of the processor that consume the most energy. Depending on
the implementation, these hottest units may vary from one implementation to
another or the weights used for estimating power may be different. However
the same scheme can be implemented irrespective of the processor.
An alternative power-monitoring scheme for use in hardware has been
proposed in [56]. There, performance counters are used for power estimation
in hardware by approximating access counts to different hardware modules
with values stored by already existing performance counters.
270 System and Microarchitectural Level Power Modeling, etc.

9.8.3 On-the-fly Optimization of the Processor


Configuration

When a hotspot is detected, a finite state machine (FSM) walks the processor
through all possible configurations for a fixed number of instructions in each
state of the FSM machine. The instruction count register (ICR) is used to
keep a count of the number of instructions retired by the processor, and it is
initialized with the number of instructions to be profiled in each
configuration. During each cycle, the ICR is decremented by the number of
instructions retired in that cycle. When ICR reaches zero, the power register
is sampled to obtain a figure proportional to the total energy dissipated. If
there were n parameters of the processor to vary, exhaustive testing of all
configurations would mean testing all points in the n-dimensional lattice for
a fixed number of instructions. If we use a set of configurations with
and with we have a total of 4 x 3 = 12
configurations, requiring an FSM of only 12 states.

9.8.4 Selective Dynamic Voltage Scaling

By performing microarchitecture resource scaling, potential for additional


savings via DVS (Dynamic Voltage Scaling) can be uncovered. Buffered
lines in array structures can be used to selectively enable some parts of the
structure and disable others. Thus, scaling down the resources of a processor
can reduce the critical path delay since the rename and window access stages
(which determine the critical path to a large extent) have latencies highly
dependent on IW and WS. This can be exploited to dynamically scale the
operating voltage while keeping the clock frequency constant. Delays in
some structures scale better than others, and some delays do not scale at all.
Dynamic supply voltages could power the structures that scale well. This
would necessitate the use of level-shifters to pass data between different
stages operating at different voltages.
The dependence of path delay on supply voltage is given by the
following equation [57]:

where is a technology dependent factor (between 1 and 2). When the


processor goes from its highest configuration (WS of 64 and IW of 8) to the
lowest (WS of 16 and IW of 4), the delay in issue logic is reduced by about
60% in a If the supply voltage was 5V to start with, scaling to the
Implications of Application Profile on Energy-aware Computing 271

lowest configuration now allows the issue logic to run at 3.6V. Assuming
that energy dissipation is proportional to the savings in energy
dissipated in the issue logic amounts to about 48%.

9.8.5 Effectiveness of Microarchitecture Resource Scaling

The power and energy savings for a typical 8-way processor with hotspot
detection and hardware power profiling are shown in Figures 9.17 and 18. In
Figure 17, there are four values indicated for each application. The Dyn
value represents the power obtained by doing dynamic microarchitecture
resource scaling, assuming a 10% energy overhead for unused units,

normalized to the power consumption of the base processor model. The


Ideal value represents the normalized power consumption, assuming no
overhead for unused units. While this is becoming increasingly difficult to
achieve with smaller technologies, this figure is given to provide an
272 System and Microarchitectural Level Power Modeling, etc.

indication of the potential savings possible with a circuit design aggressively


optimized for leakage. The figure marked Perf represents the power obtained
with the constraint that the performance hit does not exceed 12.5%. The
figure marked Dvs shows the power obtained with microarchitecture
resource scaling combined with dynamic voltage scaling applied to the
instruction window alone. For the benchmarks shown, an average power
saving of 18% and an average saving in the total energy of 8% when
compared to the base case is achieved. With dynamic voltage scaling, the
average power saving increases to 21% and the energy saving increases to
12%. Across the benchmarks shown, an average savings of 26% is obtained
in the instruction window energy (36% with dynamic voltage scaling).
The performance of the processor for various benchmarks using resource
scaling with and without the constraint on performance is shown in Figure
9.18. The average performance hit is lower for integer benchmarks than for
floating point benchmarks.
The characteristics of each application have to be taken into account
while interpreting the results. For example, the mpeg2 benchmark shows no
change in any parameter. This is because the entire execution time of the
mpeg decoder is spent inside one hotspot; the optimum configuration
determined for this hotspot is the same as the default configuration, so the
savings obtained is zero. The pegwit benchmark shows a large potential for
energy reduction with a corresponding trade-off in performance; it is in such
applications that the performance hit constraint is useful.

9.8.6 Comparison with Static Throttling Methods

Many power-management methods work by reducing the frequency of


operation of the chip at run-time. Such static clock throttling methods do not
reduce the net energy consumption for a particular task; they only serve to
spread out the consumption of the same amount of energy over a longer time
period.
The microarchitecture scaling approach is far better since for a given
penalty in performance (which can be restricted to acceptable levels) a net
savings in the total energy consumption is obtained. Other approaches have
been suggested that throttle the flow of instructions from the I-cache. Figure
9.19 shows a comparison between a microarchitecture scaling scheme and
static throttling methods, namely static I-cache throttling [58] with 2 and 4
instructions fetched per cycle and static clock throttling. It can be observed
that for given values of energy reduction achieved, microarchitecture
resource scaling provides significantly less delay than any of the static
throttling techniques.
Summary 273

9.9 SUMMARY

With the increasing importance of power consumption as a first class design


constraint for today’s digital systems, be it high-end, high-performance, or
mobile/portable (and thus, battery limited) systems, solutions to address the
problem of power modeling and optimization at the highest levels of
abstraction are needed. At the system level, the problem of
power/performance analysis and design exploration has to be addressed in a
formal manner, by using techniques that target average case behavior, under
a variety of operating scenarios induced by the application profile and input
streams. Formal methods for platform-based design and analysis have
proven their viability by handling complex systems and by providing a
scalable and sufficiently accurate solution to fast platform selection and
exploration.
At the microarchitecture-level, accuracy is more important, and thus
simulation is the analysis tool of choice. However, in order to provide
sufficient accuracy, cycle-accurate simulators of high-performance
(superscalar, out-of-order processors) become very inefficient and, thus, not
useful for efficient design exploration. Techniques for speeding-up
simulation at almost no loss in accuracy have been proposed to handle this
challenge by exploiting the locality properties of most software applications.
The same type of behavior can be used to enable microarchitecture power
management. To close the gap between system and gate-level power
274 System and Microarchitectural Level Power Modeling, etc.

management techniques, the energy-aware computing paradigm has to be


supported at the microarchitectural level by the application adaptive, fine-
grain management of hardware resources in order to achieve the best energy
efficiency under given performance constraints.

REFERENCES

[1] S. Edwards, L. Lavagno, E. A. Lee, A. Sangiovanni-Vincentelli, “Design of embedded


systems: formal models, validation, and synthesis,” Proc.IEEE, Vol.85, March 1997.
[2] Henry Chang et al., “Surviving the SOC revolution a guide to platform-based design,”
Kluwer 1999.
[3] M. Martonosi, D. Brooks and P. Bose, “Modeling and analyzing CPU power and
performance: metrics, methods, and abstractions,” Tutorial at SIGMETRICS, 2001.
[4] S. Malik, M. Martonosi, Y.-T. Li, “Static timing analysis of embedded software,” Proc.
DAC, Anaheim, CA, 1997.
[5] T.-Y. Yen, W. Wolfe, “Performance estimation for real-time distributed embedded
systems,” Proc. ICCD, 1995.
[6] Kalavade, P. Moghe, “A tool for performance estimation of networked Embedded End-
Systems,” Proc. DAC, San Francisco, CA, June 1998.
[7] Y. Shin, D. Kim, K. Choi, “Schedulability-driven performance analysis of multiple mode
embedded real-time systems,” Proc. DAC, Los Angeles, CA, June 2000.
[8] Xie, P. A. Beerel,“Accelerating markovian analysis of asynchronous systems using state
compression,” in IEEE Trans. on CAD, July 1999.
[9] R. Marculescu, A. Nandi, “Probabilistic application modeling for system-level
performance analysis,” Proc. DATE, Munich, Germany, March, 2001.
[10] M. Sgroi, L. Lavagno, A. Sangiovanni-Vincentelli, “Formal models for embedded
systems design,” IEEE Design and Test of Computers, Vol. 17, April-June 2000.
[11] H. Hulgaard, S. M. Burns, T. Amon, G. Borriello, “An algorithm for exact bounds on the
time separation of events in concurrent systems,” IEEE Trans. on Computers, Vol. 44,
no.ll,pp. 1306-1317, 1995.
[12] Mathur, A. Dasdan, R. Gupta, “Rate analysis for embedded systems,” ACM TODAES,
Vol. 3, no. 3, July 1998.
[13] F. Thoen and F. Catthoor, “Modeling, verification and exploration of task-level
concurrency in real-time embedded systems,” Kluwer 2000.
[14] J. Buck, S. Ha, E. A. Lee, D. G. Messerschmitt, “ptolemy: a framework for simulating
prototyping heterogeneous systems,” pp. 155-182, Apr. 1994.
[15] F. Balarin, et al., “Hardware-software co-design of embedded systems: The POLIS
approach,” Kluwer Academic Publishers, 1997.
[16] J. Buck and R. Vaidyanath, “heterogeneous modeling and simulation of embedded
systems in El Greco,” in Proc. CODES, March 2000.
[17] http://www.cadence.com/products/vcc.html
[18] P. Lieverse, P. Van der Wolf, E. Deprettere, K. Vissers, “A methodology for architecture
exploration of heterogeneous signal processing systems,” Proc. Workshop in SiPS,
Taipei, Taiwan, 1999.
[19] B. Plateau, K. Atif, “Stochastic automata network for modelling parallel systems,” IEEE
Trans. on Software Engineering, Vol. 17, Oct. 1991.
Summary 275

[20] B. Plateau, J. M. Fourneau, “A methodology for solving markov models of parallel


systems,” Journal of Parallel and Distributed Camp., Vol. 12, 1991.
[21] W. J. Stewart, “An introduction to the numerical solution of markov chains,” Princeton
Univ. Press, New Jersey, 1994.
[22] T. Sikora, “MPEG digital video-coding standards,” in IEEE Signal Processing
Magazine, Sept. 1997.
[23] D. Harel, “Statecharts: A visual formalism for complex systems,” in Sci. Comp. Prog,
Vol. 8,1987.
[24] T. Simunic, L. Benini, G. De Micheli, “Cycle-accurate simulation of energy
consumption in embedded systems,” Proc. DAC, New Orleans, June 1999.
[25] T. Givargis, F. Vahid, J. Henkel, “System-level exploration for pareto-optimal
configurations in parameterized systems-on-a-chip,” in Proc. IEEE/ACM Intl.
Conference on Computer Aided Design (ICCAD), November 2001.
[26] David Brooks, Vivek Tiwari, Margaret Martonosi, “Wattch: a framework for
architectural-level power analysis and optimizations,” Proc. ISCA, June 2000.
[27] S. Palacharla, N.P. Jouppi, and J.E. Smith, “Quantifying the complexity of superscalar
processors,” CS-TR-1996-1328, Univ. of Wisconsin, Nov. 1996.
[28] N. Vijaykrishnan, M. Kandemir, M.J. Irwin, H.S. Kim, and W. Ye, “Energy-driven
integrated hardware-software optimizations using simplepower,” in Proc. Intl.
Symposium on Computer Architecture, Vancouver, BC, Canada, June 2000.
[29] V. Zyuban and P. Kogge, “Optimization of high-performance superscalar architectures
for energy efficiency,” in Proc. ACM Intl. Symposium on Low Power Electronics and
Design, Portofino, Italy, July 2000.
[30] D. Marculescu and A. Iyer, “Application-driven processor design exploration for Power-
performance trade-off analysis,” in Proc. IEEE/ACM Intl. Conference on Computer
Aided Design, Nov. 2001.
[31] D. H. Albonesi, “Dynamic ipc/clock rate optimization,” in Proceedings of the
International Symposium on Computer Architecture (ISCA), 1998.
[32] D. Burger, T.M. Austin, “The simplescalar tool set, Version 2.0,” CSD Technical Report
#1342, University of Wisconsin-Madison, June 1997.
[33] C. Price, “MIPS IV Instruction Set, revision 3.1.,” MIPS Technologies, Inc., Mountain
View, CA, Jan. 1995.
[34] K.I. Farkas, N.P. Jouppi, and P. Chow, “Register file design considerations in
dynamically scheduled processors,” WRL Research Report 95/10, Digital Equipment
Corp., Nov. 1995.
[35] S.J.E. Wilton and N.P. Jouppi, “An enhanced access and cycle time model for on-chip
caches,” WRL Research Report 93/5, Digital Equipment Corp., July 1994.
[36] C.-Y. Tsui, K.-K. Chan, Q. Wu, C.-S. Ding, and M. Pedram, “A power estimation
framework for designing low power portable video applications,” in Proc. ACM/IEEE
Design Automation Conference, San Diego, June 1997.
[37] V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel, F. Baez, “Reducing power in High-
performance microprocessors,” in Proc. ACM/IEEE Design Automation Conference,
pp.732-737, June 1998.
[38] C. Svensson and D. Liu, “Low power circuit techniques,” in Low Power Design
Methodologies (Eds. J.M. Rabaey and M. Pedram), pp.37-64, Kluwer Academic,
Norwell, MA, 1996.
[39] Cacti 2.0 Technical Report, http://www.research.compaq.com/
wrl/people/jouppi/cacti2.pdf
276 System and Microarchitectural Level Power Modeling, etc.

[40] V. Zyuban and P. Kogge, “Optimization of high-performance superscalar architectures


for energy efficiency,” in Proc. ACM Intl. Symposium on Low Power Electronics and
Design, Portofino, Italy, July 2000.
[41] J. Kin et al., “Power efficient media processors: design space exploration,” in Proc.
ACM/IEEE Design Automation Conference, New Orleans, LA, June 1999,
[42] W.-T. Shiue and C. Chakrabarti, “Memory exploration for low power, embedded
systems,” in Proc. ACM/IEEE Design Automation Conference, pp. 140-145, New
Orleans, LA, June 1999.
[43] Hong et al., “Power optimization of variable voltage core-based systems, in Proc.
ACM/IEEE Design Automation Conference, San Francisco, CA, June 1998.
[44] G. Qu et al., “Energy minimization of system pipelines using multiple voltages, in Proc.
IEEE Intl. Symposium on Circuits and Systems, June 1999.
[45] M.C. Merten, A.R. Trick, C.N. George, J.C. Gyllenhaal, and W.-M. Hwu, “A hardware-
driven profiling scheme for identifying program hot spots to support runtime
optimization,” in Proc. Intl. Symposium on Computer Architecture, June 1999.
[46] J. Huang, D.J. Lilja, “Extending value reuse to basic blocks with compiler support,” in
IEEE Trans. on Computers, vol.49, No.4, Apr. 2000.
[47] R. Marculescu, D. Marculescu, and M. Pedram, “Sequence compaction for power
estimation: theory and practice,” in IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, vol. 18, No.7, July 1999.
[48] C.-S. Ding, Q. Wu, C.-T. Hsieh, M. Pedram, “Stratified random sampling for power
estimation,” in IEEE Trans. on Computer-Aided Design of Integrated Circuits and
Systems, vol.17, No.6, June 1998.
[49] P.K Dubey and R. Nair, “Profile-driven generation of trace samples,” in Proc. IEEE Intl.
Conf. on Computer Design: VLSI in Computers and Processors, Oct. 1996.
[50] A.-T. Nguyen, P. Bose, K. Ekanadham, A. Nanda, M. Michael, “Accuracy and speed-up
of parallel trace-driven architectural simulation,” in Proc. IEEE Intl. Symposium on
Parallel Processing, 1997.
[51] V.S. lyengar, P. Bose, and L. Trevillyan, “Representative traces for for processor models
with infinite cache,” in Proc. ACM Intl. Symposium on High-Performance Computer
Architecture, Feb. 1996.
[52] C.-T. Hsieh, M. Pedram, “Microprocessor power estimation using profile-driven
program synthesis,” in IEEE Trans. on Computer-Aided Design of Integrated Circuits
and Systems, vol.17, No.l 1, Nov. 1998.
[53] C.-T. Hsieh, L.-S. Chen, and M. Pedram, “Microprocessor power analysis by labeled
simulation,” in Proc. Design Automation and Test in Europe Conference, Munich,
Germany, March 2001.
[54] Iyer and D. Marculescu, “Microarchitecture-resource scaling,” in Proc. Design
Automation and Test in Europe Conference, Munich, Germany, March 2001.
[55] Iyer and D. Marculescu, “Microarchitecture-level power management,” to appear in
IEEE Trans. on VLSI Systems, 2002.
[56] R. Joseph and M. Martonosi, “Run-time power estimation in high performance
microprocessors,” in Proc. ACM Intl. Symposium on Low Power Electronics and Design,
Huntington Beach, CA, Aug. 2001.
[57] K. Chen and C. Hu, “Performance and Vdd Scaling in deep submicrometer CMOS,”
IEEE Journal of Solid State Circuits (JSSC), October 1998.
[58] H. Sanchez, B. Kuttanna, T. Olson, M. Alexander, G. Gerosa, R. Philip, and J. Alvarez,
“Thermal management system for high performance PowerPC microprocessors,” in
Proc. IEEE CompCon, 1997.
Chapter 10
Tools and Techniques for Integrated Hardware-
software Energy Optimizations

N. Vijaykrishnan, M. Kandemir, A. Sivasubramaniam, and M. J. Irwin


Microsystems Design Lab. Pennsylvania State University, University Park, PA 16802

Abstract: With the emergence of a plethora of embedded and portable applications,


energy dissipation has joined throughput, area, and accuracy/precision as a
major design constraint. Thus, designers must be concerned with both
estimating and optimizing the energy consumption of circuits, architectures,
and software. Most of the research in energy optimization and/or estimation
has focused on single components of the system and has not looked across the
interacting spectrum of the hardware and software. This chapter describes the
design of energy estimation tools that support both software and architectural
experimentation within a single framework. Furthermore, techniques that
optimize the hardware-software interaction from an energy perspective are
illustrated.

Key words: Simulation tools, energy estimation, kernel energy consumption, compiler
optimizations, architectural optimizations

10.1 INTRODUCTION

Performance optimization has long been the goal of different architectural


and systems software studies, driving technological innovations to the limits
for getting the most out of every cycle. Advancing technology has made it
possible to incorporate millions of transistors on a very small die and to
clock these transistors at very high speeds. While these innovations and
trends have helped provide tremendous performance improvements over the
years, they have at the same time created new problems that demand
immediate consideration. An important and daunting problem is the power
consumption of hardware components and the resulting thermal and
reliability concerns that it raises. As power dissipation increases, the cost of
power delivery to the increasing number of transistors and the thermal
278 Tools and Techniques for Integrated Hardware-software

packaging for cooling the components goes up significantly [1][2]. Cooling


systems need to be designed to tackle the peak power consumption of any
component. These factors are making power as important a criterion for
optimization as performance in commercial high-end systems design.
Similarly, energy optimization is of importance for the continued
proliferation of low-end battery-operated mobile systems. Unless
optimizations are applied at different levels, the capabilities of future mobile
systems will be limited by the weight of the battery required for a reasonable
duration of operation.
Just as with performance, power optimization requires careful design at
several abstraction levels [3]. The design of a system starts from the
specification of the system requirements and proceeds through several
design levels spanning across architectural design, logic design, and circuit
design, finally resulting in a physical implementation. Energy savings can be
obtained at all levels of design, ranging from low-level circuit and gate
optimizations to algorithmic selection. In earlier design methodologies,
cycling through logic synthesis and physical design was used as the main
iteration loop to refine and verify a design. This method, however, is not
keeping up with the complexity of today's system-on-chip designs. By the
time the design of today's large and complex designs have been specified to
the circuit or gate level, it may be too late or too expensive to go back and
deal with excess power consumption problems. Also, various architectural
alternatives need to be explored since achieving an optimal design during the
first design iteration is very difficult in complex designs. Thus, system
designers need advanced techniques and related tools for the early estimation
of power dissipation during the design phase in order to quickly explore a
variety of design alternatives at the architectural/RT Level.
The increasing need for low-power systems has motivated the EDA
industry to introduce power-estimation tools for the various design levels. A
number of commercial tools are routinely used to accurately estimate the
power of portions of million transistor architectures represented as transistor
or gate-level netlists. However, relatively few commercial tools exist to
support the RT-level estimation essential for design space exploration. The
most mature among these tools is Watt Watcher/Architect from Sente that
operates on a RTL description. It uses a gate-level power library and
simulation data from an RTL simulation to compute a power estimate for the
design. Accuracy is traded for improvements in run time. Only prototype
research tools/methodologies exist to support the behavioral or architectural
design level. The successful design and evaluation of architectural and
software optimization techniques are invariably tied to a broad and accurate
set of rich tools that are available for conducting these studies. While tools
for analyzing optimizations at the circuit and logic level are fairly mature,
Power Modeling 279

architectural and software power estimation and optimizations tools are in


their infancy.
This chapter will look at the issues in designing such simulators and the
power modeling issues faced. Next, the details of two different architectural
simulators targeted at superscalar and VLIW architectures will be explored.
Finally, their application will be illustrated with some case studies of how to
evaluate software and hardware optimizations using these tools.

10.2 POWER MODELING

The level of detail in the modeling performed by the simulator influences


both the accuracy of estimation as well as the speed of the simulator. Most
research in architectural-level power estimation is based on empirical
methods that measure the power consumption of existing implementations
and produce models from those measurements. This is in contrast to
approaches that rely on information-theoretic measures of activity to
estimate power [4] [5]. Measurement-based approaches for estimating the
power consumption of datapath functional units can be divided into three
sub-categories. The first technique, introduced by Powell and Chau [6], is a
fixed-activity macro-modeling strategy called the Power Factor
Approximation (PFA) method. The energy models are parameterized in
terms of complexity parameters and a PFA proportionality constant. Similar
schemes were also proposed by Kumar et al. in [7] and Liu and Svensson in
[8]. This approach assumes that the inputs do not affect the switching
activity of a hardware block. To remedy this problem, activity-sensitive
empirical energy models were developed. These schemes are based on
predictable input signal statistics; an example is the method proposed by
Landman and Rabaey [9]. Although the individual models built in this way
are relatively accurate (a 10% - 15% error rate), overall accuracy may be
sacrificed due to incorrect input statistics or the inability to model the
interactions correctly. The third empirical technique, transition-sensitive
energy models, is based on input transitions rather than input statistics. The
method, proposed by Mehta, Irwin, and Owens [10], assumes an energy
model is provided for each functional unit - a table containing the power
consumed for each input transition. Closely related input transitions and
energy patterns can be collapsed into clusters, thereby reducing the size of
the table. Other researchers have also proposed similar macro-model based
power estimation approaches [12][11].
Recently, various architectural energy simulators have been designed that
employ a combination of these power models. These simulators derive
power estimates from the analysis of circuit activity induced by the
280 Tools and Techniques for Integrated Hardware-software

application programs during each cycle and from detailed capacitive models
for the components activated. A key distinction between these different
simulators is in the degree of estimation accuracy and estimation speed. For
example, the SimplePower energy simulator [13] employs transition-
sensitive energy models for the datapath functional unit. SimplePower core
accesses a table containing the switch capacitance for each input transition of
the functional unit exercised. Table 10.1 shows the structure of such a table
for an n-input functional unit.

The use of a transition-sensitive approach has both design challenges as


well performance concerns during simulation. The first concern is that the
construction of these tables is time consuming. Unfortunately, the size of
this table grows exponentially with the size of the inputs. The table
construction problem can be addressed using partitioning and clustering
mechanisms. Further, not all tables grow exponentially with the number of
inputs. For example, consider a bit-independent functional unit such as a
pipeline register where the operation for each bit slice does not depend on
the values of other bit slices. In this case, the only switch capacitance table
needed is a small table for a one-bit slice. The total energy consumed by the
module can be calculated by summing the energy consumed by each bit
transition.
A second concern with employing transition-sensitive models is the
performance cost of a table lookup for each component accessed in a cycle.
In order to overcome this cost, simulators such as SoftWatt [14] and Wattch
[15] use a simple fixed-activity model for the functional units. These
simulators only track the number of accesses to a specific component and
utilize an average capacitive value to estimate the energy consumed.
Even the same simulator can employ different types of power models for
different components. For example, SimplePower estimates the energy
consumed in the memories using analytical models [16]. In contrast to the
datapath components that utilize a transition-sensitive approach, these
Design of Simulators 281

models estimate the energy consumed per access and do not accommodate
the energy differences found in sequences of accesses. Since energy
consumption is impacted by switching activity, two sequential memory
accesses may exhibit different address decoder energy consumptions.
However, for memories, the energy consumed by the memory core and sense
amplifiers dominates these transition-related differences. Thus, simple
analytical energy models for memories have proven to be quite reliable.
Another approach to evaluating energy estimates at the architectural level
exploits the correlation between performance and energy metrics. These
techniques [17][18] use performance counters present in many current
processor architectures to provide runtime energy estimates.
Most of the current architectural energy-estimation tools focus mainly on
the dynamic power consumption and do not account for leakage energy
accurately. Leakage modeling is especially important in future architectures
since the leakage current per transistor is increasing in conjunction with the
increasing number of transistors on a chip. Leakage energy can be modeled
based on empirical data similar to dynamic energy. As leakage currents in
functional units are dependent on the inputs, it is possible to either employ a
more accurate table lookup mechanism or an average leakage current value
that can enable a faster estimation speed. Memory elements can be modeled
analytically using the size of the memory and the characterization of an
individual cell. However, leakage energy modeling at a higher abstraction
level in an architectural simulator is a challenging task and requires more
effort. New abstraction to capture the influence of various factors such as
stacking, temperature, and circuit style as well as new leakage control
mechanisms are in their infancy.

10.3 DESIGN OF SIMULATORS

Architectural-level energy simulators can exploit the infrastructure


developed for performance evaluation tools. As examples, the design of two
energy simulators built on top of widely used architectural tools for
performance evaluation is explained further.
In this section, the design of SoftWatt is elaborated on to show how it was
built on top of the SimOS toolset. SoftWatt is designed to provide detailed
performance and power profiles for different hardware and software
components over the course of execution of real applications running on a
commercial operating system. This tool is unique in its ability to track
energy consumption during both kernel and user-execution modes.
Next, the design of VESIM, an energy simulator for a VLIW processor
built on top of the Trimaran infrastructure, is described.
282 Tools and Techniques for Integrated Hardware-software

10.3.1 A SimOS-Based Energy Simulator

SimOS, which provides a very detailed simulation of the hardware sufficient


to run the IRIX 5.3 operating system, serves as the base simulator for
SoftWatt. SimOS also provides interfaces for event monitoring and statistics
collection. The simulator has three CPU models, namely, Embra, Mipsy, and
MXS. Embra employs dynamic binary translation and provides a rough
characterization of the workload. Mipsy emulates a MIPS R4000-like
architecture. It consists of a simple pipeline with blocking caches. MXS
emulates a MIPS R10000-like [19] superscalar architecture. The overall
design of the energy simulator is given in Figure 10.1.

The MXS CPU and the memory subsystem simulators are modified to
trace accesses to their different components. This enables the simulations to
be analyzed using the Timing Trees [20] mechanism provided by SimOS.
MXS is used to obtain detailed information about the processor. However,
the MXS CPU simulator does not report detailed statistics about the memory
subsystem behavior. Due to this limitation in SimOS, Mipsy is used to obtain
this information.
Since disk systems can be a significant part of the power budget in
workstations and laptops, a disk power model is also incorporated into
SimOS to study the overall system power consumption. SimOS models a
Design of Simulators 283

HP97560 disk. This disk is not state-of-the-art and does not support any low-
power modes. Therefore, a layer is incorporated on top of the existing disk
model to simulate the TOSHIBA MK3003MAN [21] disk, a more
representative modern disk that supports a variety of low-power modes. The
state machine of the operating modes implemented for this disk is shown in
Figure 10.2. The disk transitions from the IDLE state to the ACTIVE state
on a seek operation. The time taken for the seek operation is reported by the
disk simulator of SimOS. This timing information is used to calculate the
energy consumed when transitioning from the IDLE to the ACTIVE state. In
the IDLE state, the disk keeps spinning. A transition from the IDLE state to
the STANDBY state involves spinning the disk down. This operation incurs
a performance penalty. In order to service an I/O request when the disk is in
the STANDBY state, the disk has to be spun back up to the ACTIVE state.
This operation incurs both a performance and energy penalty. The SLEEP
state is the lowest power state for this disk. The disk transitions to this state
via an explicit command from the operating system.
It is assumed that the spin up and spin down operations take the same
amount of time and that the spin down operation does not consume any
power. This model also assumes that the transition from the ACTIVE to the
IDLE state takes zero time and power as in [22]. Currently, the SLLEP state
is not utilized. The timing modules of SimOS are suitably modified to
accurately capture mode transitions. While it is clear that modeling a disk is
important from the energy perspective, the features of a low-power disk can
also influence the operating system routines such as the idle process running
on the processor core. Hence, a disk model helps to characterize the
processor power more accurately. During the I/O operations, energy is
consumed in the disk. Furthermore, as the process requesting the I/O is
blocked, the operating system schedules the idle process to execute.
Therefore, energy is also consumed in both the processor and the memory
subsystem.
SoftWatt uses analytical power models. A post-processing approach is
taken to calculate the power values. The simulation data is read from the log
files, pre-processed, and input to the power models. This approach results in
the loss of per-cycle information as data is sampled and dumped to the
simulation log file at a coarse granularity, However, there is no slowdown in
the simulation time beyond that incurred by SimOS itself. This is particularly
critical due to the time-consuming nature of MXS simulations. The only
exception to this rule is the disk energy model, where energy consumption is
measured during simulation to accurately account for the mode transitions.
This measurement incurs very little simulation overhead. SoftWatt models a
simple conditional clocking model. It assumes that full power is consumed if
any of the ports of a unit is accessed; otherwise no power is consumed.
284 Tools and Techniques for Integrated Hardware-software

The per-access costs of the cache structures are calculated based on the
model presented in [16] [15]. The clock generation and distribution network
is modeled using the technique proposed in [23], which has an error margin
of 10%. The associative structures of the processor are modeled as in
[15][24].
An important and difficult task in the design of architectural energy
simulators is the validation of their estimates. Due to the flexibility provided
by the architectural tools in evaluating different configurations, even
choosing a configuration to validate is challenging. A common approach
used in several works is to validate estimates of configurations similar to
commercial processors for which published data sheets are available [15]. As
an example, in order to validate the entire CPU model, here Soft Watt is
configured to calculate the maximum CPU power of the R10000 processor.
In comparison to the maximum power dissipation of 30W reported in the
R10000 data sheet [25], SoftWatt reports 25.3W. As detailed circuit-level
information is not available at this level, generalizations made in the
analytical power models that do not, for example, capture the system-level
interconnect capacitances result in an estimation error.

10.3.2 Trimaran-based VLIW Energy Simulator

Very Long Instruction Word (VLIW) architectures are becoming popular


and being adopted in many DSP and embedded architectures [26]. These
Design of Simulators 285

architectures are inherently more energy efficient than superscalar


architectures due to their simplicity. Instead of relying on complex hardware
such as dynamic dispatchers, VLIW architectures depend on powerful
compilation technology. Various compiler optimizations have been designed
to improve the performance of these VLIW architectures [27][28]. However,
not much effort has been expended to optimize the energy consumption of
such architectures. In this section, the design and use of a VLIW energy
simulation framework, VESIM, is presented to enable more research on
optimizing energy consumption in VLIW architectures. The VLIW energy
estimation framework proposed here provides flexibility in studying both
software and hardware optimizations. As VESIM is built on top of Trimaran
compilation and simulation framework, it has access to various high-level
and low-level compiler optimizations and can easily permit implementation
of new compiler optimizations.
Trimaran is a compiler infrastructure used to provide a vehicle for
implementation and experimentation for state–of-the-art research in compiler
techniques for Instruction Level Parallelism (ILP) [29]. As seen in Figure
10.3, a program written in C flows through IMPACT, Elcor, and the cycle-
level simulator. IMPACT applies machine-independent classical
optimizations and transformations to the source program, whereas Elcor is
responsible for machine-dependent optimizations and scheduling. The cycle-
level simulator generates run-time information for profile-driven
compilations. The cycle-level simulator was modified to trace the access
patterns of different components of the architecture. This profile information
was used along with technology-dependent energy parameters to obtain the
energy consumption of the architecture. The VLIW energy estimation
framework presented here is activity-based in that energy consumption is
286 Tools and Techniques for Integrated Hardware-software

based on the number of accesses to the components. The Trimaran


framework is also enhanced in order to model the cache by incorporating the
DineroIII cache simulator. The major components modeled in this energy-
estimation framework include the instruction cache, register files,
interconnect structure between register files and the functional units, the
functional units, data cache, and clock circuitry. This tool has also been
augmented to provide leakage-energy estimation. However, it only
approximates the leakage energy per cycle for a component in terms of the
component's per access dynamic energy. Such an abstraction provides an
ability to study optimizations for future technologies for which actual
characterization information is not available.

10.4 HARDWARE-SOFTWARE OPTIMIZATIONS:


CASE STUDIES

In this section, the use of architectural level energy simulators in estimation


and optimization is illustrated using three case studies. First, it is
demonstrated that SoftWatt can be used to track the influence of kernel
routines and power-efficient peripherals on the system's energy profile. Next,
it is shown that VESIM can be used to explore the influence of compiler
optimizations and architectural modifications on system power.

10.4.1 Studying the Impact of Kernel and Peripheral Energy


Consumption

Table 10.2 gives the baseline configuration of SoftWatt that was used for the
experiments in this section. The Spec JVM98 benchmarks [30] were chosen
for conducting this characterization study. Java applications are known to
exercise the operating system more than traditional benchmark suites [31].
Thus, they form an interesting suite to characterize for power in a power
simulator like SoftWatt that models the operating system.
Figure 10.4 presents the overall power budget of the system including the
disk. This model is the baseline disk configuration and gives an upper bound
of its power consumption. It can be observed that, when no power-related
optimizations are performed, the disk is the single-largest consumer of
power in the system.
By including the IDLE state in the disk configuration, the dominance of
the disk in the power budget decreases from 34% to 23% as shown in Figure
10.5. This optimization provides significant power-savings and also alters
the overall picture. Now the L1 I-cache and the clock dominate the power
profile.
Hardware-software Optimizations: Case Studies 287

SoftWatt also provides the ability to analyze the energy behavior of


kernel routines in more detail and show that 15% of the energy is consumed
in executing the kernel routines for the target applications. Among the kernel
services, the utlb and read services are the major contributors to system
energy. MIPS architectures have a software-managed TLB. The operating
system handles the misses by performing the required address translation,
reloads the TLB, and then restarts the user process. These operations are
done by the utlb service. However, the frequently used utlb routine has
smaller power consumption as compared to read since it exercises fewer and
less energy-consuming components.
The characterization of the kernel routines presented here also provides
insight for accelerating the energy-estimation process. It was observed that
the per-invocation of the kernel services is fairly constant across different
applications. Thus, it is possible to estimate the energy consumed by kernel
code with an error margin of about 10% without detailed energy simulation.
288 Tools and Techniques for Integrated Hardware-software

In addition, the results reveal the potential for power optimizations when
executing the kernel idle process. Whenever the operating system does not
have any process to run, it schedules the idle process. Though this has no
performance implications, over 5% of the system energy is consumed during
this period. This energy consumption can be reduced by transitioning the
CPU and the memory subsystem to a low-power mode or by even halting the
processor, instead of executing the idle process.
Hardware-software Optimizations: Case Studies 289

10.4.2 Studying the Impact of Compiler Optimizations

VLIW processors suffer from insufficient parallelism to fill the functional


units available. Block-formation algorithms such as superblock and
hyperblock are often used [27][28] to help overcome this problem. It has
been shown that significant performance improvement can be obtained by
using these algorithms. Here the goal is to evaluate the three block formation
algorithms, basic block (BB), superblock (SB), and hyperblock (HB), from
the energy perspective. First, a brief overview of the two block formation
compiler optimizations is provided.

10.4.2.1 Superblock

Frequently executed paths through the code are selected and optimized at the
expense of the less frequently executed paths [27]. Instead of inserting
bookkeeping instructions where two traces join, part of the trace is
duplicated to optimize the original copy. This scheduling scheme provides
an easier way to find parallelism beyond the basic block boundaries. This is
especially true for control intensive benchmarks because the parallelism
within a basic block is very limited.

10.4.2.2 Hyperblock

The idea is to group many basic blocks from different control flow paths into
a single manageable block for compiler optimization and scheduling using
if-conversion [28].
290 Tools and Techniques for Integrated Hardware-software

Two benchmarks from Spec95Int (129.compress and 130.li), two


benchmarks from Mediabench (mpeg2dec, adpcmdec), and the dspstone
benchmarks were selected. The results from the dspstone benchmarks were
averaged. 128 GPR and 4 integer ALUs were assumed. Other parameters
such as instruction latencies are from Trimaran’s standard mdes file. Modulo
scheduling is on.
Figure 10.6 shows the number of cycles and energy used for each
benchmark. All values are scaled to those of BB and no modulo scheduling
case. For 129.compress, dspstone, and mpeg2dec, both energy and
performance show a similar trend. For adpcmdec and 130.li, there is an
anomaly in the SB case. On close examination, it is observed that there is a
50% increase in the number of instructions executed after SB formation as
compared to BB. Note that this does not translate to an increase in the
number of cycles as the average ILP is increased. However, the increased
number of instructions executed manifests itself in the form of increased
energy. Another trend that was observed was that the SB and HB techniques
were not that successful for the dspstone benchmarks. These benchmarks are
quite small and regular. Hence, they do not gain from the more powerful
block-formation techniques.
Figure 10.7 shows the component wise breakdown of the energy graph in
Figure 10.6. It is observed that the data cache and register file energy costs
increase with SB and HB due to an increased number of instructions
executed. But the energy of the instruction cache clock energy decreases
because of a reduction in the number of clock cycles due to increased ILP.
Hardware-software Optimizations: Case Studies 291

10.4.3 Studying the Impact of Architecture Optimizations

In this section, an example of how energy-estimation tools can be used as an


aid in embedded system design is shown.
As more and more functional units are put into processors, excessive
pressure is put on the register file since the number of registers and ports in
the register file need to be large enough to sustain the large number
functional units. This creates a performance bottleneck. One solution is to
partition the register file into multiple register banks.
Consider a register file organization that trades space for improved
energy consumption behavior. Instead of a single monolithic register file for
all functional units, the functional units are partitioned into two parts with
their own local register file forming two clusters. Additionally, both of the
clusters have access to a common register file as shown in Figure 10.8. The
common register file is used to store variables that are accessed by
functional units in both the clusters. In contrast, the local register files are
accessible only to the functional units in the cluster. While the number of
registers in the resulting architecture is three times more than that of the
single monolithic register file, the clustered architecture reduces the
complexity of the local register files. The number of ports in the local
register files is reduced by half compared to that of the single monolithic
register file since local register files are accessed by only half the number of
functional units. The common register file has the same number of ports as
the monolithic register file. As the energy consumption of the register file is
a function of both the number of ports and the number of registers, the
energy cost per access to the local register file is less than that of the
common register file and the original monolithic register file. When most of
the accesses are confined to the local register file, one can anticipate
improvements. The register allocation procedure in the compiler must be
modified to exploit the local register file organization.
Figure 10.9 shows the relative energy consumption of the register file
architecture compared to the monolithic register file. Hyperblock and
modulo scheduling was on and Trimaran's std parameter set is used. All
benchmarks show reduced energy consumption compared to using one
monolithic register file. Especially when GPR size is 32, less than half of the
energy than with a monolithic register file is consumed. It should be noted,
though, that energy is being traded for area because register files were
duplicated. The energy saving comes from the reduced number of read ports
(reduced by 4) and write ports (reduced by 2), and fewer interconnections
to/from the local register files.
292 Tools and Techniques for Integrated Hardware-software

10.5 SUMMARY

In this chapter, an overview of energy models used in different architectural


energy simulators was presented. Next, an insight into the design of such
simulators was provided using SoftWatt and VESIM as examples.
Experiments using SoftWatt were used to illustrate the importance of a
holistic perspective in the evaluation of new architectures. From a software
Summary 293

perspective, it is not sufficient to only account for user code energy estimate
since operating system routines can consume a significant portion of the
energy consumed. This could cause significant overestimation of battery life
for executing application. From a hardware perspective, the experiments
indicate the importance of accounting for peripheral devices such as the disk
in estimating overall energy budget. As optimizations on one component can
have negative ramifications on other components, simulation tools should
provide an estimate for the entire system in order to evaluate the real impact
of such optimizations.
Finally, a VESIM energy estimation framework built on top of the
Trimaran tool set for a VLIW architecture was presented. This framework
was used to show the impact of architectural and compiler optimizations on
energy efficiency. As power consumption continues to be the major limiter
to more powerful and faster designs, there is a need for further explorations
of such software- and architectural-level optimizations.

ACKNOWLEDGMENT

The authors wish to acknowledge the contributions of the students from the
Microsystems Design Group at Penn State who have worked on several
projects reported in this chapter. We would like to specially acknowledge the
contributions of Wu Ye, Hyun Suk Kim, Sudhanva Gurumurthi and Soontae
Kim.

REFERENCES
[1] D. Brooks and M. Martonosi, “Dynamic thermal management for high-performance
microprocessors,” In Proceedings of the Seventh International Symposium on High
Performance Computer Architecture, January 2001.
[2] V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel, and F. Baez, “Reducing Power in
High-Performance Microprocessors,” In Proceedings of the Design Automation
Conference, June 1998.
[3] M. Irwin, M. Kandemir, N. Vijaykrishnan, and A. Sivasubramaniam, “A Holistic
approach to system level energy optimization,” In Proceedings of the International
Workshop on Power and Timing Modeling, Optimization, and Simulation, September
2000.
[4] D. Marculescu, R. Marculescu, and M. Pedram, “Information theoretic measures of
energy consumption at register transfer level,” In Proceedings of 1995 International
Symposium on Low Power Design, pp. 81, April 1995.
[5] J. M. Rabaey and M. Pedram, “Low power design methodologies,” Kluwer Academic
Publishers, Inc., 1996.
294 Tools and Techniques for Integrated Hardware-software

[6] S. Powell and P. Chau, “Estimating power dissipation of VLSI signal processing chips:
the PFA technique,” In VLSI Signal Processing, IV , pp. 250, 1990.
[7] N. Kumar, S. Katkoori, L. Rader, and R. Vemuri, “Profile-driven behavioral synthesis
for low power VLSI systems,” IEEE Design and Test Magazine, pp. 70, Fall 1995.
[8] D. Liu and C. Svensson, “Power consumption estimation in CMOS VLSI chips,” IEEE
Journal of Solid State Circuits, pp. 663, June 1994.
[9] P. Landman and J. Rabaey, “Activity-sensitive architectural power analysis,” IEEE
Transaction on CAD, TCAD-15(6), pp. 571, June 1996.
[10] H. Mehta, R. M . Owens, and M. J. Irwin, “Energy characterization based on clustering,”
In Proceedings of the 33rd Design Automation Conference, pp. 702, June 1996.
[11] Q. Wu, Q. Qiu, M. Pedram, and C-S. Ding, “Cycle-accurate macro-models for rt-level
power analysis,” IEEE Transactions on VLSI Systems, 6(4), pp. 520, December 1998.
[12] L. Benini, A. Bogoliolo, M. Favalli, and G. De Micheli, “Regression models for
behavioral power estimates,” In Proceedings of International Workshop on Power,
Timing Modeling, Optimization and Simulation, pp. 179, September 1996.
[13] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. Irwin, “The design and use of
simplepower: a cycle-accurate energy estimation tool,” In Proceedings of the Design
Automation Conference, June 2000.
[14] S. Gurumurthi, A. Sivasubramaniam, M. J. Irwin, N. Vijaykrishnan, M. Kandemir, T. Li,
and L. K. John, “Using complete machine simulation for software power estimation: The
SoftWatt Approach,” In Proceedings of the International Symposium on High
Performance Computer Architecture, Feb 2002.
[15] D. Brooks, V. Tiwari, and M. Martonosi. Wattch, “A framework for architectural-level
power analysis and optimizations,” In Proceedings of the 27th International Symposium
on Computer Architecture, June 2000.
[16] M. B. Kamble and K. Ghose, “Analytical energy dissipation models for low power
caches,” In Proceedings of the International Symposium on Low Power Electronic
Design, pp. 143–148, August 1997.
[17] R. Joseph, D. Brooks, and M. Martonosi, ”Runtime power measurements as a foundation
for evaluating power/performance tradeoffs,“ In Proceedings of the Workshop on
Complexity Effectice Design, June 2001.
[18] I. Kadayif, T. Chinoda, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, and A.
Sivasubramaniam, “vEC: virtual energy counters,” In Proceedings of the ACM
SIGPLAN/SIGSOFT Workshop on Program Analysis for Software Tools and
Engineering, June 2001.
[19] K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2):28--
40, April 1996.
[20] S. A. Herrod, “Using complete machine simulation to understand computer system
bavior,” PhD thesis, Stanford University, February 1998.
[21] Toshiba Storage Devices Division, http://www.toshiba.com/.
[22] K. Li, R. Kumpf, P. Horton, and T. E. Anderson, “Quantitative Analysis of Disk Drive
Power Management in Portable Computers,” Technical Report CSD-93-779, University
of California, Berkeley, 1994.
[23] D. Duarte, N. Vijaykrishnan, M. J. Irwin, and M. Kandemir, “Formulation and validation
of an energy dissipation model for the clock generation circuitry and distribution
networks,” In Proceedings of the 2001 VLSI Design Conference, 2001.
[24] S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexity-effective superscalar
processors,” In Proceedings of the 24th International Symposium on Computer
Architecture, 1997.
Summary 295

[25] R10000 Microprocessor User’s Manual.


http://www.sgi.com/processors/r10k/manual/t5.ver.2.0.book_4.html.
[26] Texas instruments device information.
http://dspvillage.ti.com/docs/dspproducthome.jhtml.
[27] W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R.
G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. holm, and D. M. Lavery, “The
superblock: an effective technique for VLIW and superscalar compilation,” The Journal
of Supercomputing, pp. 229--248, May 1993.
[28] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, “Effective
compiler support for predicated execution using the hyperblock,” In Proceedings of 25th
Annual International Symposium on Microarchitecture, pp. 45--54, 1992.
[29] Trimaran, http://www.trimaran.org.
[30] Spec JVM98 Benchmark Suite. http://www.spec.org/osg/jvm98/.
[31] T. Li, L. K. John, N. Vijaykrishnan, A. Sivasubramaniam, J. Sabarinathan, and A.
Murthy, “Using complete system simulation to characterize specjvm98 benchmarks,” In
Proceedings of the International Conference on Supercomputing, pp. 22-33, May 2000.
[32] Avant/ Star-Hspice. http://www.avanticorp.com/products.
[33] F. Douglis and P. Krishnan, “Adaptive disk spin-down policies for mobile computers,”
Computing Systems, 8(4):381--413, 1995.
[34] P. E. Landman, “High-level power estimation,“ In Proceedings of the International
Symposium on Low Power Electronics and Design, pp. 29, August 1996.
[35] M. Lajolo, A. Raghunathan, S. Dey, L. Lavagno, and A. Sangiovanni-Vincentelli,
“Efficient Power Estimation Techniques for HW/SW Systems,” In Proceedings of IEEE
Volta, 1999.
[36] J. R. Lorch, “A complete picture of the energy consumption of a portable computer,”
Master’s thesis, University of California, Berkeley, December 1995.
[37] Y.-H. Lu and G. D. Micheli, “Adaptive hard disk power management on personal
computers,” In Proceedings of the IEEE Great Lakes Symposium, March 1999.
[38] R. P.Dick, G. Lakshminarayana, A. Raghunathan, and N. K. Jha, “Power analysis of
embedded operating systems,” In Proceedings of the 37th Conference on Design
Automation, pp. 312--315, 2000.
[39] M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta, “Complete Computer System
Simulation: The SimOS Approach,” IEEE Parallel and Distributed Technology: Systems
and Applications, 3(4):34--43, 1995.
[40] T. Simunic, L. Benini, and G. D. Micheli, “Cycle-accurate simulation of energy
consumption in embedded Systems,” In Proceedings of the Design Automation
Conference, June 1999.
This page intentionally left blank
Chapter 11
Power-aware Communication Systems

Mani Srivastava
University of California, Los Angeles

Abstract: Battery-operated systems usually operate as part of larger networks where they
wirelessly communicate with other systems. Conventional techniques for low-
power design, with their focus on circuits and computation logic in a system,
are at best inadequate for such networked systems. The reason is two fold.
First, the energy cost associated with wireless communications dominates the
energy cost associated with computation. Being dictated primarily by a totally
different set of laws (Shannon and Maxwell), communication energy, unlike
computation energy, does not even benefit much from Moore's Law. Second,
designers are interested in network-wide energy-related metrics, such as
network lifetime, which techniques focused on computation at a single system
cannot address. Therefore, in order to go beyond low-power techniques
developed for stand-alone computing systems, this chapter describes
communication-related sources of power consumption and network-level
power-reduction and energy-management techniques in the context of
wirelessly networked systems such as wireless multimedia and wireless sensor
networks. General principles behind power-aware protocols and resource
management techniques at various layers of networked systems - physical,
link, medium access, routing, transport, and application - are presented. Unlike
their conventional counterparts that only manage bandwidth to achieve
performance, power-aware network protocols also manage the energy
resource. Their goal is not just a reduction in the total power consumption.
Rather, power-aware protocols seek a trade-off between energy and
performance via network-wide power management to provide the right power
at the right place and the right time.

Key words: Power-aware communications, power-aware protocols, radio power


management.
298 Power-aware Communication Systems

11.1 INTRODUCTION

Power consumption is the primary design metric for battery operated


embedded computing systems and has led to the development of a variety of
power reduction and management techniques. However, an emerging trend
in embedded systems is that they are being networked to communicate,
usually wirelessly, with other devices and with servers over the Internet.
Examples include cell phones, wireless PDAs, wireless embedded sensors,
etc. Conventional power reduction [1] and management techniques [2], with
their focus on digital circuits and computation logic, are inadequate for these
networked embedded systems. There are two reasons for this.
First, the energy cost of wireless communications often dominates that of
computation. For example, the power consumption of current PDAs such as
Pocket PCs is substantially less than that of the wireless LAN cards that can
be used with these PDAs. Another measure of the dominance of
communication energy is the ratio between the energy consumed for
communicating one bit and the energy consumed for one instruction or
operation. Indeed, even for devices with low data rates and short
communication ranges (and therefore smaller communication-related power
consumption), such as wireless sensors, this ratio is quite high and ranges
from O(100) to O(1000) [3]. To make matters worse, much of the power
required for wireless communications is due to the transmit RF power
needed for successful reception at the desired distance, throughput, and error
rate. For example, an IEEE 802.l1b wireless LAN card based on Intersil's
PRISM II chipset would consume approximately 110 mW for the MAC
(medium access control) processor, 170 mW for the digital baseband
electronics, 240 mW for the analog electronics, and 600 mW for the power
amplifier (which generates 63 mW of irradiated RF power). In other words,
roughly 54% of the power consumption is attributed to the RF power. The
relative share of RF power will only become larger as the power
consumption of digital circuits reduces with technology improvements. The
required transmit RF power depends on the signal-to-noise ratio required at
the receiver as per Shannon's Information Theory, and the path loss as per
Maxwell's Laws of Electromagnetism. This is quite different from power
consumption in electronic circuits, which is due primarily to the dynamic
charging and discharging of capacitors and leakage currents. Neither
semiconductor technology nor circuit- and architecture-level techniques such
as static [1] and dynamic [4] voltage scaling, which have helped with
reduction of power in circuits, are of much use in reducing the transmit
power for communications.
Second, in many networked systems the energy-related metric of interest
is the lifetime of the entire system, as opposed to power consumption at
Where Does the Energy Go in Wireless Communications 299

individual nodes. Such is the case, for example, in networks of wireless


sensors [5] used to monitor physical spaces such as wildlife habitats, smart
classrooms and offices, battlefields, etc. A technique that consumes less
aggregate power but results in power hot-spots where a small number of
nodes see significant energy drain would be less preferred relative to a
technique that consumes higher aggregate power but results in a spatially
even energy drain throughout the network. Conventional approaches such as
dynamic power management (DPM) via shutdown [6] and dynamic voltage
scaling (DVS) [9][10][7][8] are inadequate for networked systems. They
focus on a single computation node as opposed to the network, and the
dynamic voltage scaling that they often make use of is ineffective as an
energy-speed control knob for communication systems.
Clearly, a major hole exists in the research thus far on power reduction
and management techniques. They address well the power consumed by
compute operations in embedded systems, but do not yet address the power
consumed by communication operations. This is a serious concern in devices
that use wireless communication, which is much more power-hungry than
wired communication. For example, contemporary 802.l1b wireless LAN
cards consume substantially more power than Ethernet cards at similar data
rates. Moreover, wireless communication is likely to be used with battery-
operated devices, making the power problem all the more important.
This chapter, therefore, restricts itself to the power problem in the context
of wireless communications and seeks to describe recent developments in
wireless communications and networking that have begun to address the gap
in low-power systems research. After first introducing the wireless
communication related sources of power consumption, various recently
developed techniques for reduction and management of power in wirelessly
networked systems, such as wireless multimedia devices and wireless
sensors, are described.

11.2 WHERE DOES THE ENERGY GO IN WIRELESS


COMMUNICATIONS

11.2.1 Electronic and RF Energy Consumption in Radios

Prior to delving into the power reduction and management techniques


themselves, it is important to know what the sources of power consumption
in wireless communications are. This is important because these sources are
quite different from those in the case of circuits where the
300 Power-aware Communication Systems

charging/discharging of capacitances, short circuit currents, and leakage


currents are the sources of power consumption [1].
To understand power consumption in wireless communications, it is
useful to understand what happens inside the radios used for wireless
communications. Figure 11.1 shows the block diagram of a canonical radio
consisting of separate transmission (Tx) and reception (Rx) paths that both
interface to higher-layer protocol processing. In order to transmit
information, bits are coded into channel symbols, which correspond to
different waveforms [11]. The number of possible waveforms determines
how many bits are coded into one symbol, which is given by the modulation
level b, expressed in the number of bits per symbol. The average time to
transmit one bit over the channel is the inverse of the average bit rate and
given by equation (11.1), where is the symbol rate (number of symbols
that are transmitted per second).

The Tx path consists of link-layer and baseband processing (digital circuits


and possibly software) followed by radio frequency processing (RF analog
circuits) and finally a power amplifier that converts electrical energy into
radio energy for transmission through the antenna. The Rx path consists of
RF analog electronics for radio frequency processing of the signal received
by the antenna, followed by link-layer and baseband processing performed
digitally. The higher-layer protocol processing is performed digitally using a
Where Does the Energy Go in Wireless Communications 301

mix of dedicated hardware and software on programmable processors, and


consists of protocols such as medium access control (e.g., 802.11 MAC
protocol), routing (e.g., ad hoc multi-hop routing via diffusion at nodes in a
wireless sensor network), and transport (e.g., TCP).
The energy consumed by the radio can be viewed to have two broad
components:
Electronic power that is consumed by the digital and analog
circuits that perform the necessary RF, baseband, and protocol
processing. Of course, this power is different for data reception and
transmission and depends on factors such as the symbol rate at which
the communication occurs, the modulation scheme used, the condition
of the channel, etc.
Transmit radio power that is consumed by the radio power
amplifier in the transmit path
The energy associated with transmitting or receiving a bit can therefore
be expressed as in equations (11.2) and (11.3) below, where and
represent the electronic power for the transmit and receive cases,
respectively.

While is due to power consumed by digital and analog processing


required for wireless communication, Shannon's Information Theory and
Maxwell's Laws dictate Therefore, is not helped by the technology,
circuit, and architecture techniques that help reduce and manage In
particular, is a function of the efficiency of the power amplifier and the
power that needs to be irradiated by the transmitter for successful reception
at the receiver. may thus be reduced by either creating higher efficiency
amplifiers or by reducing the radio power that needs to irradiated for given
destination, channel, and performance (data rate and error probability). In
addition to being influenced by the amplifier circuit, the amplifier efficiency
is also affected by the modulation scheme, the choice of which dictates the
required amplifier linearity. The required irradiated power depends on the
signal power required by the receiver (a function of the data rate, bit error
rate, noise, interference, modulation scheme, and receiver architecture) and
the path loss suffered by the signal as it decays with receiver-transmitter
separation d in a fashion, where is the path loss exponent (=2 in
free space but in real life channels).
302 Power-aware Communication Systems

11.2.2 First-order Energy Model for Wireless


Communication

A simple energy model for a radio with a specified modulation scheme and
data rate can be obtained by setting and as constant
energy/bit due to electronics at the Rx and Tx, while treating as an
energy/bit term at the Tx that is proportional to where r is the radio
range. Clearly, for wireless communications over large r , the
communication energy will be dominated by the RF term as
is set to be large, while for short r the electronic power terms
and ) would dominate as is set to be small. For example,
typical state-of-the-art numbers reported in the literature for Bluetooth-class
radios are 50 nJ/bit for the electronic power terms and for the
RF power term [12]. Therefore, for radios designed for ranges shorter than
(e.g., personal area networks) the electronic power consumption
dominates the energy spent on communication while at larger ranges (e.g.,
wireless LANs, cellular systems) the RF power consumption dominates.
Besides transmit and receive states, radios can be in two other states:
sleep and idle. In the sleep state, the radio is essentially off and consumes
little or no power. In the idle state, the radio is listening for data packet
arrival but is not actively receiving or transmitting data. Traditionally, the
idle-state power consumption of the radio is often assumed to be
insignificant, and the energy spent on communication is counted as the
energy spent on the data packets actually received or transmitted. In reality,
the idle-state power consumption is almost the same as in the receive mode,
and ignoring it can lead to fallacious conclusions about the relative merits of
the various protocols and power management strategies [13]. For example,
the TR1000 radio transceiver from RF Monolithics is a radio commonly
used in wireless sensor networks. This low-power radio has a data rate of 2.4
Kbps and uses On/Off Key (OOK) modulation. For a range of 20 m, its
power consumption is 14.88 mW, 12.50 mW, 12.36 mW, and 0.016 mW in
the transmit, receive, idle, and sleep states, respectively. Clearly, idle
listening is not cheap!

11.2.3 Power consumption in Short-range Radios

The electronic-power dominant short-range wireless communications, which


is important for many emerging embedded systems, is a very different realm
from the usual RF-power dominant long-range wireless communications.
For short-range radios at GHz carrier frequencies, the RF electronic power
(synthesizers, mixers, etc.) can easily dominate the RF power at the
transmitter. For example, [14] reports an electronic power of 10-100 mW for
Where Does the Energy Go in Wireless Communications 303

Bluetooth-class radios at 1 Mbps date rate and a 1E-5 bit error rate (BER).
RF analog circuits whose power consumption does not vary much with data
rate in turn dominate the electronic power in these radios. An implication of
this is that using more energy-efficient but lower data rate modulation
schemes, a strategy that is effective for long-range communications, does not
help with short-range communications. Rather, using a high data rate but
energy-inefficient modulation and then shutting the radio down might be
more effective. A hindrance in the case of small data packet sizes is the long
time overhead and the resulting wasted energy in current radios as they
transition from shutdown to the active state.

Another observation is that the digital and analog electronic processing is


usually more complex at the receiver, which has the harder task of decoding
and, therefore, uses more complex signal processing functions (e.g.,
equalizers) than at the transmitter. Therefore, for short-range radios it is
possible that the total energy spent in receiving a bit is more than the energy
spent in transmitting it. Such is the case in Figure 11.2, which shows power
measurements of a prototype wireless sensor device from UCLA in Rx mode
and in Tx mode at different power levels [15]. This situation is the complete
reverse of the usual assumption that transmit power dominates and renders
inefficient conventional strategies that minimize the time a radio spends in
the transmit mode at the expense of an increase in receive time.
304 Power-aware Communication Systems

11.3 POWER REDUCTION AND MANAGEMENT FOR


WIRELESS COMMUNICATIONS

The discussion in the previous section reveals that the sources of power
consumption in communications are quite diverse and different from sources
such as capacitive charging/discharging, leakage current, etc. that are the
basis for power reduction and management techniques developed for
processors, ASICs, etc. While some of these techniques can certainly be
used to address the electronic power consumption during communication,
much of the power consumption during communication lies beyond their
reach.
This has led to the recognition that one needs power reduction and
management techniques for wireless communications that specifically target
(i) new sources of power consumption, such as RF power, (ii) new
opportunities for power-performance trade-off, such as choices of
modulation and protocols, and (iii) new problems, such as how to wake up a
sleeping radio when the wake-up event is at a remote node.
The remainder of this chapter describes a selection of such techniques
that have been developed. These techniques seek to make communication
more power-efficient by reducing the number of raw bits sent across per
useful information bit that needs to be communicated, or by reducing the
amount of power needed to transmit a raw bit, or by a combination of the
two. The goal of many of the techniques presented is not mere power
reduction but rather power awareness whereby power consumption is
dynamically traded-off against other metrics of system performance such as
throughput, network coverage, accuracy of results, etc. This is done by
intelligent adaptation of control knobs offered by the system components
such as the radio or a protocol.
In the case of digital and analog processing, the various power-reduction
and management techniques have been classified according to whether they
are technology-level techniques (e.g., lowering threshold voltages), circuit-
level techniques (e.g., low supply voltage), architecture-level techniques
(e.g., shutdown or dynamic voltage scaling by an operating system), or
algorithm-level techniques (e.g., power-efficient signal processing
algorithms). Classifying according to technology, circuit, architecture, and
algorithm levels is not appropriate in the case of communications, and a
better way is to classify according to the layer of the communication
protocol stack that a technique impacts. While the seven-layer OSI protocol
stack is the standard for networked systems, the various techniques presented
in this chapter are classified into two broad classes: lower-layer (physical,
link, MAC) and higher-layer (routing, transport, application) techniques.
Historically, the purpose of layering has been to hide information across
Lower Layer Techniques 305

layers to allow for modular implementation of communication systems.


However, aswill be seen repeatedly in this chapter, the various power-
reduction and management techniques for wireless communications rely
extensively on knowing information about the state of the other layers in the
system, and big gains come from optimizing across the layers and exploiting
the coupling between the protocol layers [16].

11.4 LOWER LAYER TECHNIQUES

11.4.1 Dynamic Power Management of Radios

Analogous to shutting down all or parts of a digital computing system to


save energy [6], shutting down the radio when not in use is an obvious
strategy for saving energy. As is now well-understood, in the case of digital
circuits a strategy of reducing voltage in order to run slowly and consume
the entire time budget is better than running at full speed and then shutting
down in the remaining idle time. This is because in CMOS circuits,
performing an operation slower requires reduced energy. This has been
effectively exploited by various schemes proposed in recent years. For
example, in processors the operating systems perform energy-aware task
scheduling by dynamically varying the voltage to minimize energy
consumption while meeting throughput requirements or deadlines.
Recent research has shown that there exists a powerful class of dynamic
power management techniques for radios that exploit control knobs
providing similar energy-speed relationships as in CMOS circuits. In other
words, “slowdown is better than shutdown” is often true for radios as well.
Two such readily accessible control knobs in radios are the modulation level
knob and the error coding, which allow energy and data rate to be traded-off
for a given bit error rate. We call these knobs Dynamic Modulation Scaling
(DMS) [17] and Dynamic Code Scaling (DCS), respectively.
To provide energy awareness, these control knobs need to be integrated
into a power management policy much as DVS in circuits is exploited by
power management policies incorporated into, for example, the operating
system task schedulers. In communications, the analogue of task scheduling
is packet scheduling over the wireless link. Just as researchers in the past
have created energy-aware versions of task scheduling algorithms for
operating systems, recent research has led to energy-aware versions of
packet scheduling algorithms [18]. Despite the analogies between DVS and
radio-level techniques such as DMS and DCS, there are important
differences. For example, the modulation setting cannot be changed midway
306 Power-aware Communication Systems

through a packet, and the packet itself has to be transmitted non-pre-


emptively. Also, the wireless channel may vary over time. These variations
have to be taken into account in the energy-aware packet-scheduler.
This subsection first describes DMS and DCS, the two radio-level control
knobs that allow energy reduction by communicating more slowly. Next,
currently available energy-aware packet scheduling schemes that exploit
these radio control knobs for the dynamic power management of radios are
described.

11.4.1.1 The Energy-speed Control Knobs

Dynamic modulation scaling


To understand DMS one needs to analyze the detailed relationship between
the energy and the modulation level. The scheme that is probably most
amendable to scaling due to its ease of implementation and analysis is
Quadrature Amplitude Modulation (QAM) [11]. The resulting Bit Error Rate
(BER) is well approximated by equation (11.4). In this equation, is the
noise energy per symbol, and factor A contains all transmission loss
components. The function is defined in equation (11.5) .

By solving for the transmit power, equation (11.6) is obtained, where


parameter is defined as in equation (11.7), and the inefficiency of the
amplifier is ignored.

where and is a function of the

receiver implementation and the operating temperature. A depends on


distance and the propagation environment and can vary with time. Neither of
them varies with b. Due to the function, is only very weakly
dependent on b. The main benefits from modulation scaling are due f(b).
Lower Layer Techniques 307

The power consumption of the electronic circuitry, which is largely analog,


can be written as in equation (11.8) where is a constant.

With equations (11.6) and (11.8), the expression in equation (11.2) for
total energy spent in communicating a raw bit becomes an explicit function
of the modulation level:

Together equations (11.1) and (11.9) give the trade-off between the
energy and the delay in sending a bit. A similar trade-off exists for other
modulation schemes, such as for Phase Shift Keying (PSK) and Pulse
Amplitude Modulation (PAM), with appropriate definitions of f(b) and In
general, DMS is applicable to other scalable modulation schemes as well.
Although the discussion so far has assumed that the modulation level can
be varied continuously, in reality the analysis presented is valid only for
integer values of b. In the case of QAM, the expressions are exact only for
even integers but are reasonable approximations when b is odd [11]. One can
308 Power-aware Communication Systems

even define a fractional modulation level whereby different parts of a packet


are sent with different modulation levels so that the packet as a whole has an
effective fractional modulation level obtained by linear interpolation.
One should also note that it is impractical to change the modulation level at
arbitrary time instants since both sender and receiver need to know the exact
modulation scheme that is used. A change in modulation level at the sender
can either be negotiated with the receiver via a protocol handshake or can be
described in a well-known packet header field. The need to coordinate the
change in modulation level between the sender and the receiver is a
crucial difference between DMS and DVS and affects the power-
management policy as well.
Figure 11.3 illustrates the energy-delay trade-off for QAM, with
and The curve labelled
“Ideal” corresponds to equations (11.1) and (11.9). The circles indicate
constellations that can be realized, and the solid line gives the values that are
obtained via the interpolation to fractional modulation levels. For QAM, the
minimum modulation level is equal to 2. It is clear that equations (11.1) and
(11.9) are very good approximations to what is practically realizable.
Finally, one needs to address the question as to when DMS is effective.
Figure 11.4 shows the same energy-delay trade-off for QAM for different
values of Note that varies with b due to and the values on the graph
are at b=2 and correspond to varying from 2.25 mW to 144 mW. DMS
Lower Layer Techniques 309

exploits the effect that by varying the modulation level, energy can be traded
off versus delay, as explained above. On the left side of the curves in Figure
11.4, lowering b reduces the energy, at the cost of an increased delay.
Scaling to the right of the point of minimum energy clearly does not make
sense, as both energy and delay would increase. The operating region of
DMS, therefore, corresponds to the portion of the curves to the left of their
energy minimum points. In this region, modulation scaling is superior to
radio shutdown, even when the overhead associated with waking up a
sleeping radio is ignored, because with shutdown the total energy per bit
will not change. The energy-delay curves are convex so that a uniform
stretching of the transmissions is most energy efficient, similar to what has
been observed for DVS [10] where the energy vs. speed curve is convex as
well.
From Figure 11.4 note also that DMS is more useful for situations where
is large, or in other words, where the transmit power dominates the
electronics power. This is true except for wireless communication systems
with a very short range.
Dynamic code scaling
Another radio-level control knob is DCS, the scaling of the forward error
correcting code that is used. Coding introduces extra traffic that is
characterized by the rate of the code, which is the ratio of the size of the
original data to the size of the coded data. Using a radio with a given symbol
rate, the use of a lower-rate code results in larger time but lower energy that
is needed to get the same number of information bits across at a specified bit
error rate. For example, consider a wireless channel with additive white
Gaussian noise, average signal power constraint P, and noise power N. As
shown in [19], under optimal coding, the RF energy spent to reliably
transmit each information bit is proportional to where s
the number of raw symbol transmissions needed to send each information
bit, and is a function of the BER and the modulation scheme that
gives the ratio between the number of information bits that are reliably
transmitted per symbol to the channel capacity in bits/symbol.
decreases monotonically with s, or, equivalently, the
energy taken to transmit each information bit decreases monotonically as the
time allowed to transmit that bit is increased. Indeed, as [19] mentions, for
practical values of SNR for a wireless link, there is a 20x dynamic range
over which the energy per information bit can be varied by changing the
transmission time. Similar energy-delay behavior is observed for real-life
sub-optimal codes as well. Figure11. 5 shows the energy vs. delay behavior
of a real-life multi-rate convolutional code from [20].
310 Power-aware Communication Systems

Lastly, note that the energy-delay curves due to DCS are also convex
(besides being monotonically decreasing) just as is the case for DVS and
DMS, so that a uniform stretching of the transmissions is most energy
efficient as observed in [10].

11.4.1.2 Exploiting the Radio-level Energy-speed Control knobs in


the Energy-aware Packet Scheduling

Just as DVS has driven the dynamic power management of digital


computation systems beyond shutdown-based approaches, DMS and DCS
pave the way towards dynamic power management of radios that goes
beyond shutdown. In particular, analogous to various OS-level task-
scheduling approaches that make use of DVS, one can develop new energy-
aware packet scheduling policies for the link layer in the wireless protocol
stack. The body of literature dealing with packet scheduling is vast, and, in
principle, can be extended towards energy-aware versions using DMS and
DCS. However, many challenges lie ahead, since radio power management
must deal with both traffic load and channel variations. The remainder of
this subsection describes two examples of energy-aware scheduling
approaches that exploit DMS and illustrate some of the challenges. They
each highlight one of two different issues, namely the presence of deadlines
and channel variations.
Energy-aware real-time packet scheduling in time-invariant wireless
channels
Lower Layer Techniques 311

Consider a scenario with multiple packet streams being sent by a wireless


device, with the packets in each stream being generated periodically and
needing to reach their destination receiver by a deadline. The destinations
are one hop away, and the periods and deadlines may be different for the
different streams. The length of the packets within a stream may vary, but
there is a maximum packet size known for each stream. Such a scenario
might occur with a wireless sensor node with multiple transducers sending
data to receivers or for a wireless multimedia device sending audio, video,
and other media streams to a basestation. To keep the focus on real-time
constraints, the wireless channel itself is assumed to be stationary with no
time varying impairments such as fading.
The question that needs to be answered is---how can radio control knobs
such as DMS or DCS be exploited to minimize the power while meeting the
real-time constraints? The problem is similar to scheduling tasks in a real-
time operating system (RTOS) running on a CPU with DVS [8]. However,
there is one key difference: in most communication systems the packet
transmission cannot be suspended and resumed later. In other words, the
packet scheduling is non-preemptive.
One approach to this problem is based on using a non-preemptive earliest
deadline first (EDF) scheduler together with DMS. The schedulability
conditions for a non-preemptive EDF scheduler are available in literature
[21], using a practical heuristic approach [18] to the problem as described
below:

1. Admission step: When a new stream is admitted to the system, a static


scaling factor, is calculated assuming all packets are of
maximum size. This factor is the minimum possible such that if the
modulation setting for each packet would be scaled by it, the
schedulability test is still satisfied. In other words, it computes the
slowest transmission speed at which all the packet streams are
schedulable.
2. Adjustment step: At run-time, packets are scheduled using EDF.
Before transmission starts, the actual size of each packet is known. An
additional scaling factor, is calculated such that the transmission
finishes when that of a maximum size packet would have. Since step 1
assumed the maximum packet size, the schedulability is guaranteed. If
the system would still be idle after the packet transmission, the
transmission is stretched until the packet’s deadline or the arrival time
of a new packet. This extra scaling factor is called

The scheduler combines all three scaling factors to get the overall
modulation that is used for the current packet. To see the benefit of this
312 Power-aware Communication Systems

approach, consider a simulation scenario with basic parameters the same as


those used previously in Figure 11.3, i.e.,
The packet sizes are uniformly distributed
between the maximum packet size of 1000 bytes and a minimum value. A
special field in the packet header, encoded with 4-QAM, is used to
communicate the modulation level for the rest of the packet to the receiver.
The possible modulation levels range from bits/symbol to
bits/symbol in steps of bits/symbol, and are coded in 4 bits in the
packet header. Figure 11.6 plots the energy consumption when using
energy-aware packet scheduling, normalized to a scheme without scaling (b
at all times). The different plots correspond to sample scenarios with
multiple periodic streams, with different total link utilizations U. For U =
0.82, the figure separates the contributions of the different scaling factors.
When only using the transmissions are slowed down uniformly
without exploiting the run-time packet length variations. These are leveraged
via where the energy decreases as the packet size variation increases
The effect of is marginal in this example.

The power-management scheme described above essentially exploits


traffic load variations on two levels to introduce energy awareness.
Lower Layer Techniques 313

1. Variations in overall utilization are handled by the admission step


through These are due to changes in number of streams, which
are likely to occur over relatively large time scales.
2. Variations in individual packet sizes on the other hand occur at much
smaller time scales. These cannot be handled during admission, but are
exploited in the adjustment step through and
Energy-aware non-real-time packet scheduling in time-variant wireless
channels
Another issue in radio power management is the effect of time variations in
the wireless channel. This has no direct equivalent in DVS-based CPU task
scheduling. To introduce some of the challenges, consider a rudimentary
scenario: the transmission of a single data stream that has no hard deadline
associated with it but only an average data rate constraint. This model is
useful in the case of a file transfer, for example.
As discussed previously, the parameter A in equation (11.7) captures the
effect of the wireless channel. In the presence of time variations, this factor
is split up into two components as in equation (11.10), where represents
the average value, and contains the normalized time variations. The
behavior of the gain factors can be characterized by two statistics: a
probability density function and a Doppler rate, which describes the time
correlation [11].

To cope with channel variations, an estimate of the current channel


condition is needed. This is obtained through channel estimation, which is
updated regularly. The update rate is chosen such that the channel
remains approximately constant between updates, yet the overhead of the
estimation process is limited. In the previous subsection, the dynamic power
management of radios using DMS turned into a scheduling problem because
of the interaction between multiple streams. Here, only one stream is
considered, but the presence of a time-varying channel makes the choice of
the best value of b again a scheduling issue. The decision depends on how
good or bad the channel will be in future, i.e., whether it is more energy-
efficient to send now or later. If the average throughput is the only additional
constraint, the problem can be greatly simplified. As shown in [22], there
exist thresholds that directly link the current channel condition to the
optimal choice of b. Equation (11.11) is a generalization of the results in [22]
and yields a DMS-based energy-aware packet-scheduling policy for this
condition.
314 Power-aware Communication Systems

where

There is only one independent parameter left, which can be solved from
the constraint on the desired average data rate expressed in average
number of bits per symbol [22]. Thus, the thresholds only depend on the
statistics of the wireless channel, which can be estimated online. One no
longer has to know the exact behavior of the channel over time to achieve
the energy-optimal scheduling policy.
Figure 11.7 shows the simulated performance of this radio power
management scheme for different values of the average throughput
constraint. The basic parameters are the same as in the real-time energy-
aware packet-scheduling scheme in the preceding subsection:
and possible modulation levels
from to in steps of
bits/symbol. The time correlation of the channel is characterized by a
Doppler rate of 50 Hz, an update rate of 1 kHz for the channel
estimation that was selected, and the maximum possible transmit power is 1
W.
Curve 1 in Figure 11.7 plots the behavior of the “loading in time”
scheduling policy described here. It is superior to scaling with “constant b”
(curve 2), where the modulation is uniformly slowed down based on the
average throughput, but channel variations are not taken into account. The
difference between curve 2 and curve 3, which shows the same uniform
scaling in a non time-varying channel, illustrates the performance
degradation associated with channel variations. Beyond bits/symbol,
one resorts to shutdown, and both these curves flatten out, which is as
expected from the earlier discussion on DMS. However, curve 1 keeps on
Lower Layer Techniques 315

decreasing when lowering and can even outperform scaling in a non


time-varying channel (curve 3). The reason is that one still uses shutdown,
but only the very best time intervals are selected to carry
information. For curve 2, the shutdown was periodic, without taking the
channel state into account. Finally, curve 4 corresponds to a scheme that is
not energy-aware but tries to achieve the “maximum throughput” possible.
In this case, b is adapted to yield its maximum value without violating the
maximum transmit power. As this is only based on the current channel
condition, scheduling issues never arise. The benefits of energy awareness,
where a reduced throughput requirement is leveraged to yield energy
savings, are substantial.

11.4.2 More Lower-layer Energy-speed Control Knobs

In addition to DMS and DCS there are other radio-level control knobs that
one can exploit for power management. In fact, the interaction between
performance and energy at the radio level is much more complex than for
CPUs with many more run-time variables. The raw radio-level performance
is a function of three variables: the BER, the RF transmit power, and the raw
data rate. The modulation scheme and bit-level coding choices both decide
where the radio operates in this three-dimensional space. In DCS and DMS
316 Power-aware Communication Systems

the BER is kept constant and the other two variables are traded-off. One can
certainly imagine more sophisticated control knobs that trade-off among all
the three variables simultaneously and are under the control of an energy-
aware scheduler, although no such scheme has yet been reported in the
literature.
The situation, however, is even more complex because rarely is an
application interested in low-level data throughput or BER. Rather, the radio
is separated from the applications by layers of protocols that execute
functions such as packetizing the application data as payloads of packets
with headers, performing packet-level error control such as retransmission of
lost and corrupted packets, and packet-level forward error coding. The real
measure of performance is the rate at which application-level data is reliably
getting across. This is often called the goodput, which is a function of not
only the raw data rate and the BER, but also the nature of the intervening
protocols and the packet structure they impose. If one were to trade energy
for goodput, many other control knobs become available which depend on
the protocols used. One such control knob, described below, is the
adaptation of the length of the frames in which the application data is sent.
Another knob is the adaptation of packet-level error control [23].

11.4.2.1 Frame Length Adaptation

In order to send data bits over a wireless link, the bits are grouped into link-
layer frames (often called MAC frames) and scheduled for transmission by
the MAC mechanism. Typically, higher-layer packets, such as IP datagrams,
are fragmented to fit into these link-layer frames and reassembled at the
receiver. However, when the underlying channel is variable, operating with a
fixed frame length is inefficient, and it is better to adapt it to the momentary
channel condition instead [24].
Each frame has a cyclic redundancy check (CRC) to determine whether it
contains errors. Although adaptive frame-level forward error correction
could be treated in conjunction with frame length adaptation, we restrict
ourselves to the simpler frame-level error detection case here. Since there is
no correction capability, a single bit error leads to the entire frame to being
dropped. Therefore, smaller frames have a higher chance of making it
through. Each frame, however, contains a fixed header overhead, such that in
relative terms this overhead increases with decreasing frame length. The
length of the frame’s payload and header field are denoted by L and H,
respectively. For a point-to-point communication link, the crucial metric of
performance is the “goodput,” which is the actual data rate G offered to the
higher layers [23]. It takes into account the fact that header overhead and
Lower Layer Techniques 317

erroneous transmissions do not contribute useful data and in the presence of


uncorrelated bit errors is given by equation (11.12).

For a given transmit RF power, the energy per good application level bit
would be proportional to the inverse of the goodput expression above.
Therefore, it is more energy efficient if the frame length L is selected so that
the goodput G is maximized for a given radio and channel condition. The
data field size that maximizes the goodput, and therefore minimizes the
energy spent per good application-level bit, is given by equation (11.13):

When the BER varies slowly, i.e., over a timescale sufficiently larger
than the frame transmission time, these expressions correspond to the
optimal values at each moment in time. By estimating the BER over time,
the frame length settings can track the channel variations by adapting it
according to equation (11.13).
A straightforward approach to frame length adaptation would be to
directly estimate the BER at regular intervals via bit error measurement, and
set the L accordingly. In order to obtain an accurate estimation, the BER has
to be averaged over a large time window, which severely limits the
responsiveness of the adaptation. Therefore, it is better to use lower-level
channel parameters, measured by the radio, that indicate the quality of the
channel and can be used to estimate BER and thus the appropriate frame
length.
More results on frame length adaptation can be found in [24].

11.4.3 Energy-aware Medium Access Control

Another lower-layer factor that influences the energy efficiency of wireless


communications is the medium access control (MAC) protocol that helps
arbitrate access by multiple transmitters to the shared wireless channel.
MAC protocols accomplish channel sharing in one of two ways. In the
first category are MAC protocols that rely on random channel access by a
transmitter, often under the control of a probabilistic collision avoidance
scheme. If a collision does happen, back-off and retransmission are used. A
good example of such a protocol is the MAC protocol in the widely used
318 Power-aware Communication Systems

802.11 wireless LAN standard. A feature of random access MAC protocols


is that the receiver is always listening for a transmitter to send data, thus
implicitly assuming that idle listening has no detrimental effect. In the
second category are MAC protocols based on time-division multiplexing
(TDM) whereby the participating nodes are time-synchronized and see time
as being divided into slots. Typically either a fixed periodic schedule, or a
dynamic schedule decided by a centralized entity such as a basestation and
updated every frame on N slots is used. The relative rigidity of these TDM-
based MAC protocols, the overhead of higher protocol related signaling
traffic, and the difficulty of performing the slot scheduling in a distributed
fashion have led to their use being limited to scenarios such as voice cellular
networks, where the architecture is centralized and the ability of TDM-based
MAC protocols to provide guaranteed access time is essential.
Since the MAC protocols directly control which mode (transmit, receive,
idle listening, and sleeping) the various radios in the network are in at any
instant of time, they can have a significant impact on energy consumption.
Moreover, the power consumption in the various modes depends on factors
such as radio range and data rate, and therefore no single MAC protocol is
likely to be the most energy efficient across the board. Rather, different
MAC protocols are likely to be useful in different scenarios such as wide
area cellular systems, wireless LANs with access points, ad hoc wireless
LANs with no access points, and ad hoc networks of short range wireless
devices. Therefore it is perhaps more important to understand what are the
desirable attributes of an energy efficient MAC and use that to design or pick
the MAC protocol for a specific application.
Following are the attributes of energy-efficient MAC protocols.

Reduce time radio is in transmit mode. This requires minimizing


random access collisions and consequent retransmissions. Techniques
such as polling and reservation of slots are often used.
Reduce time radio is in receive mode. This requires minimizing
time spent in idle listening for packets to arrive. Techniques such as
broadcasting a periodic “beacon” telling the receivers when to wake
up [25] or using a separate low power paging or wake-up radio can
help.
Reduce transmit-receive and on-off turn-around time. This
requires maximizing contiguous transmission slots from a radio.
Allow mobiles to voluntarily enter into sleep mode. This requires
that senders buffer the frames that they wanted to send to a node
while it was sleeping and some mechanism whereby the sleeping
nodes can know that there are new packets for them. Again, a wake-
up or paging radio can help.
Higher Layer Techniques 319

Reduce MAC protocol related signaling traffic. Protocols that


require control packets to be exchanged for tasks such as channel
acquisition and packet reception acknowledgment have to pay an
overhead for packet transmission upfront.

Practical MAC protocols involve a trade-off among these attributes, and


many researchers have developed energy-efficient MAC protocols that make
different choices suitable for different system scenarios [25] [26] [27] [28].

11.5 HIGHER LAYER TECHNIQUES

Techniques for energy efficiency at the lower layers of wireless


communications are able to deal only with nodes that share a common
channel. This suffices for scenarios such as point-to-point links or for
cellular systems where nodes communicate to an access point or basestation.
However, many emerging applications involve large-scale ad hoc networks
where two nodes that are many hops apart need to communicate. A good
example is an ad hoc network of wireless sensors [5] In such cases, there is
potential for much more impact on energy consumption if one were to
consider the network as a whole and consider end-to-end communications
through the network instead of individual nodes or links. This falls within
the realm of higher-layer protocols and even the application. Higher-layer
techniques for energy efficiency seek to make the network as a whole
energy-aware.

11.5.1 Network Topology Management

One important class of network-level techniques are those that are based on
the idea that not all nodes in an ad hoc network need to have their radios
active all the time for multi-hop packet forwarding. Many nodes can have
their radios put to sleep, or shutdown, without hurting the overall
communication functioning of the network. Since node shutdown impacts
the topology of the network, these techniques are also called topology
management approaches. Shutdown is an important approach because the
only way to save power consumption in the communication subsystem is to
completely turn off the node’s radio, since the idle mode is as power hungry
as the receive mode and, in case of short range radios, as power hungry (or
more) as the transmit mode as well. However, as soon as a node powers
down its radio, it is essentially disconnected from the rest of the network
topology and, therefore, can no longer perform packet relaying.
320 Power-aware Communication Systems

For simplicity, here we refer to a node being shutdown or asleep,


although we really mean that its radio is being turned off. The rest of the
node may or may not also be active depending on what the node does. A
node whose sole purpose is to act as a communication entity can probably be
shutdown in entirety. On the other hand, a node in a wireless sensor network
might still have its sensor-processing unit active to detect local events.
Effective topology management via node shutdown requires approaches
that take a global view of the entire network. Conventional shutdown-based
power management approaches for embedded systems [6] [2] address
shutdown within individual nodes and cannot coordinate the node shutdown
decisions to maintain network communication functionality at the desired
level. Moreover, in most useful situations one needs to avoid supposedly
optimum centralized approaches for coordinating the shutdown of spatially
distributed nodes. The large communication energy cost associated with a
central coordination approach would overwhelm the energy savings that one
expects from shutdown to begin with. Therefore, intuitively, it would be
desirable to have network-level power management approaches that perform
the shutdown of spatially-distributed nodes via algorithms that are
distributed and operate on the basis of local information, i.e., localized
distributed.
The goal of topology management is to coordinate the sleep transitions of
all the nodes, while ensuring that data can be forwarded efficiently, when
desired, to the data sink. Recent research has yielded two broad categories of
approaches, and their hybrid. In the first category are approaches such as
GAF [13] and SPAN [29] that leverage the fact that nearby nodes in a dense
network can be equivalent for traffic forwarding, and therefore, redundant
nodes can be shutdown while maintaining the capacity or connectivity of the
network at all times. Essentially, these approaches trade energy with density.
In the second category are approaches such as STEM [30] that rely on the
observation that in many applications it is wasteful to maintain the capacity
or connectivity of the network at all times. Such is the case, for example, in a
network of sensors and actuators where the network is in a monitoring state
much of the time and gets activated only when an event of interest takes
place. These techniques aggressively put nodes to sleep and provide a
mechanism to wake the nodes up along the communication path when they
are needed to forward data. Thus, energy is traded with communication set-
up latency. Finally, there are hybrid approaches that combine the two ideas
to trade energy with both density and set-up latency. In the remainder of this
subsection we describe each of these three types of approaches.
Higher Layer Techniques 321

11.5.1.1 Topology Management via Energy vs. Density Trade-off

Recently, several schemes that seek to trade excess node density in ad hoc
networks for energy have appeared in the literature. Two of the first ones
were GAF [13] and SPAN [29]. These techniques operate under the
assumption that a constant network capacity needs to be maintained at all
times and try to do so by shutting redundant nodes down. No use is made of
the knowledge of the overall state of the networked application. So, for
example, whether a network of wireless sensors is monitoring or actively
communicating data, these techniques try to provide the same capacity.
With SPAN a limited set of nodes forms a multi-hop forwarding
backbone that tries to preserve the original capacity of the underlying ad-hoc
network. Other nodes transition to sleep states more frequently as they no
longer carry the burden of forwarding data of other nodes. To balance out
energy consumption, the backbone functionality is rotated between nodes,
and as such, there is a strong interaction with the routing layer.
Geographic Adaptive Fidelity (GAF) exploits the fact that nearby nodes
can perfectly and transparently replace each other in the routing topology.
The sensor network is subdivided into small grids, such that nodes in the
same grid are equivalent from a routing perspective. At each point in time,
only one node in each grid is active, while the others are in the energy-
saving sleep mode. Substantial energy gains are, however, only achieved in

To illustrate this class of topology management schemes further, this


section delves deeper into the behavior of GAF. An analysis of the energy
benefits of GAF is presented here. This analysis, while not performed by the
GAF authors themselves [13], helps one fully understand the energy-density
trade-off. The GAF algorithm is based on a division of the sensor network in
to a number of virtual grids of size r by r, as shown in Figure 11.8. The
value of r is chosen such that all nodes in a grid are equivalent from a
routing perspective. This means that any two nodes in adjacent grids should
322 Power-aware Communication Systems

node locations depicted in Figure 11.8, one can calculate that r should
satisfy where R is the transmission range of a node. The average
number of nodes in a grid is where N is the total number of
nodes in field of size L x L. The average number of neighbors of a node will
be so that one gets From now assume that
and
Since all nodes in a grid are equivalent from a routing perspective, this
redundancy can be used to increase the network lifetime. GAF only keeps
one node awake in each grid, while the other nodes turn their radio off. To
balance out the energy consumption, the burden of traffic forwarding is
rotated between nodes. For analysis, one can ignore the unavoidable time
overlap of this process associated with the handoff. If there are m nodes in a
grid, the node will (ideally) only turn its radio on of the time and,
therefore, will last m times longer. When distributing nodes over the sensor
field, some grids will not contain any nodes at all. Let be the fraction of
used grids, i.e., those that have at least one node. As a result, the average
number of nodes in the used grids is

The average power consumption of a node using GAF, is shown


in equation (11.14). In this equation, is the power consumption of a
node if GAF is not used. It thus contains contributions of receive, idle,
and transmit modes, as the node would never turn its radio off. With
GAF, in each grid only one node at a time has its radio turned on, so the
total power consumption of a grid, is almost equal to
(neglecting the sleep power of the nodes that have their radio turned off).
Since M’ nodes share the duties in a grid equally, the power
consumption of a node is 1/M’ that of the grid, as in equation (11.14).

The average relative energy for a node is thus given by:

The lifetime of each node in the grid is increased with the same factor
M’. As a result, the average lifetime of a grid, i.e., the time that at least one
node in the grid is still alive, is given by equation (11.16), where is the
Higher Layer Techniques 323

lifetime of a node without GAF. One can consider a grid to be a “virtual


node” that is composed of M’ actual nodes.

Note that and which are averages over all of the grids, only
depend on M’ and not on the exact distribution of nodes in the used grids. Of
course, the variance of both the node power and the grid lifetime depends on
the distribution. If one had full control over the network deployment, one
could ensure that every used grid has exactly M’ nodes. This would
minimize the power and lifetime variance.

The top curve in Figure 11.9 shows how GAF trades of energy with node
density in a specific scenario. The simulation results are close to the results
from the analysis presented above. The scenario is for a network with 100
nodes, each with radio range of 20 m, and a square area of size L x L in
which the nodes are uniformly deployed. The size L is chosen such that the
average number of one-hop neighbors of a node is 20 and leads to L = 79.27
m. The MAC protocol is a simplified version of the 802.11 Wireless LAN
MAC protocol in its Distributed Coordination Function mode. The radio data
rate is 2.4 kbps. The node closest to the top left corner detects an event and
sends 20 information packets of 1040 bits to the data sink with an inter-
324 Power-aware Communication Systems

packet spacing of 16 seconds. The data sink is the sensor node located
closest to the bottom right corner of the field. The average path length
observed is between 6 and 7 hops. The results are averages over 100
simulation runs.

11.5.1.2 Topology Management via Energy vs. Set-up Latency Trade-


off

An example of topology management by trading energy reduction for an


increase in communication set-up time is STEM [30] for wireless sensor
networks. The basic idea is that since a sensor network is in a monitoring
state a vast majority of time during its life, it is futile to preserve network
connectivity during that time. Ideally, one would like to only turn on the
sensors and some pre-processing circuitry. When a possible event is
detected, the main processor is woken up to analyze the data in more detail.
The radio, which is normally turned off, is only woken up if the processor
decides that the information needs to be forwarded to the data sink. Of
course, different parts of the network could be in monitoring or transfer
state, so, strictly speaking, the “state” is more a property of the locality of
node, rather than the entire network.
An approach that is closely related to STEM is the use of a separate
paging channel to wake up nodes that have turned off their main radio.
However, the paging channel radio cannot be put in the sleep mode for
obvious reasons. This approach thus assumes that the paging radio consumes
much lower power than the one used for regular data communications. It is
yet unclear if such a radio can be designed, although there are research
projects underway. As we can see, STEM basically emulates the behavior of
a paging channel by having a radio with a low duty cycle instead of a radio
with low-power consumption.
Now, the problem is that the radio of the next hop in the path to the data
sink is still turned off if it did not detect that same event. As a solution, each
node periodically turns on its radio for a short time in order to listen to see if
someone wants to communicate with it. The node that wants to
communicate, the “initiator node,” sends out beacons with the ID of the
node it is trying to wake up, called the “target node.” In fact, this can be
viewed as the initiator node attempting to activate the link between itself and
the target node. As soon as the target node receives this beacon, it responds
to the initiator node and both keep their radio on at this point. If the packet
needs to be relayed further, the target node will become the initiator node for
the next hop, and the process is repeated.
Once both nodes that make up a link have their radio on, the link is active
and can be used for subsequent packets. In order for actual data
Higher Layer Techniques 325

transmissions not to interfere with the wake-up protocol, one solution is to


send them in different frequency bands using a separate radio in each band.
Other options include using a single radio capable of operating in two
distinct frequency bands at different times or to use a single time-
synchronized radio with logical channels defined in time for data and
control. Both of the last two options have performance penalties though.

Figure 11.10 shows STEM’S operation at one particular node in the


network. At time the node wants to wake one of its neighbors up and thus
becomes an initiator. It starts sending beacon packets on frequency until it
receives a response from the target node, which happens at time At this
moment, the radio in frequency band is turned on for regular data
transmissions. Note that at the same time, the radio in band still wakes up
periodically from its sleep state to listen to see if any nodes want to contact
it. After the data transmissions have ended (e.g., at the end of a
predetermined stream of packets, after a timeout, etc.), the node turns its
326 Power-aware Communication Systems

radio in band off again. At time it receives a beacon from another


initiator node while listening in the band. The node responds to the
initiator and turns its radio on again in band
In order for the target node to receive at least one beacon, it needs to turn
on its radio for a sufficiently long time, denoted as Figure 11.11
illustrates the worst-case situation where the radio is turned on just when it is
too late to receive the first beacon. In order to receive the second beacon,
should be at least as long as the transmit time of a beacon packet, plus the
inter-beacon interval
Even in the case of two radios, collisions in the wake-up plane are
possible. To handle this problem, extra provisions are added to the basic
STEM operation. A node also turns on its data radio when there is a collision
in the wakeup plane. It does not truly receive packets, but it can detect the
presence of signal energy, which is similar to the principle of carrier sensing.
In this case, it does not send back an acknowledgement, as it would likely
collide with that of other nodes that are also woken up this way.
After waiting for a response from the target node for time T, the initiator
starts transmitting on the data plane. Indeed, the target node will either have
received the beacon correctly or seen a collided packet, as it surely has
woken up once during this period. In any case, it has turned the radio in the
data plane on. If there is no collision, the STEM protocol sends back an
acknowledgement, so that the initiator knows immediately when the target
node is up. This shortens the set-up latency. If nodes do not receive data for
some time, they time out and go back to sleep. This happens to nodes that
were accidentally woken up. Eventually only the desired target node keeps
its data-plane radio on for the duration of the data transfer. The regular MAC
layer handles any collision that takes place on the data plane.
The benefit of STEM can be quantified by the ratio of the energy
consumption at a node with STEM to the energy without STEM. Analysis
[30] shows that this relative energy is given by the following equation,

where is the duty cycle in the wakeup plane, is the average


duration of data burst which requires a path set-up, is frequency of path
set-up, and and P are the power consumption of the radio in the sleep
and active state states (the radio power is assumed to be the same in idle,
transmit, and receive states). If is and the network is mostly in
the monitoring state, then the ratio reduces to Also, the
average set-up latency per hop is
Higher Layer Techniques 327

For the same network scenario as in the previous subsection for GAF,
and with and a single data transfer (so that
is the inverse of the simulation time), the two plots in Figure 11.12 show
the normalized average set-up latency per hop as a function of the
328 Power-aware Communication Systems

inverse duty cycle and the normalized power E /t as a function


of where is the fraction of time in the transfer state. As a basis for
comparison, the latter plot includes a curve for the case without topology
management. For fair comparison, there is only one radio in the base
scheme, which is never turned off. The other curves represent the
performance for STEM with different values of As increases, the
monitoring state becomes more predominant. STEM results in energy
savings as soon as which means that the network is in the transfer
state about half the time. When in the monitoring state about 99% of the
time, the network can already exploit the full benefits of STEM.
Figure 11.13 explicitly shows the trade-off between energy savings and
set-up latency for different values of The energy gains of STEM are
substantial, and can be traded off effectively with setup latency. For
example, in the regime where the network is in the monitoring state 99% of
the time a ten-fold decrease of energy consumption requires only
a setup latency of about 1.3 seconds per hop. Note that this is for a relatively
slow radio with a bit-rate of just 2.4 Kbps. By choosing a radio that is 10
times faster, this latency would be a mere 130 ms.
Higher Layer Techniques 329

11.5.1.3 Hybrid Approach

Topology management schemes, such as GAF and SPAN, coordinate the


radio sleep and wakeup cycles while ensuring adequate communication
capacity. The resulting energy savings increase with the network density.
STEM, on the other hand, leverages the set-up latency. The two types of
schemes can be integrated to achieve additional gains by exploiting both the
density and the set-up latency dimensions against energy. Figure 11.14
shows how a hybrid scheme based on combining STEM and GAF [30]
performs against STEM or GAF alone. The STEM+GAF combination
STEM+GAF outperforms STEM or GAF, except for extremely high set-up
latencies or extremely high densities, which are far beyond any practical
values. The combination of STEM and GAF thus performs well at any
reasonable operating point in the latency-density dimensions, exploiting both
of them as much as possible. Even at low densities or low latencies, the other
dimension can be traded for energy savings. The gains are compounded
when both dimensions can be exploited together. Compared to a network
without topology management, the STEM+GAF combination easily reduces
the energy consumption to 10% or less.
Increased energy savings can be obtained at the cost of either deploying
more nodes or allowing more set-up latency per hop. These choices are
essentially part of a multi-dimensional design trade-off, which is impacted
by the specific application, the layout of the network, the cost of the nodes,
the desired network lifetime, and many other factors.

11.5.2 Energy-aware Data Routing

Another important class of higher-layer approaches for energy efficiency


focuses on the routing protocols that are used to disseminate data.
Traditionally, these protocols have been designed for wired networks, such
as the Internet, and have not focused on energy. Energy-efficient routing had
gained much attention for wireless ad hoc networks, and several techniques
[31][12][32] have been proposed to select the routing path with a certain
energy-related goal in mind. For example, [32] describes routing protocols
for wireless ad hoc networks that use metrics such as energy consumed per
packet, time to network partition, and variance in node energy levels, in
contrast to conventional routing path selection metrics such as minimum
hop, shortest delay, maximum link quality, etc. Indeed, the routing paths
selected when using energy-based metrics are often different from those
obtained when using conventional metrics. For example, a path that seeks to
minimize the energy spent may be the one that avoids congested areas where
the interference level is higher and, therefore, may not be the shortest one.
330 Power-aware Communication Systems

It is important to note that the goal in energy-aware routing is not simply


to select the path that would yield minimum energy consumption in routing a
packet. Often, the goal is to maximize the network lifetime. It is important to
avoid paths that would result in power hot spots developing and then
isolating large parts of the network. Routes through regions of the network
that are running low on energy resources should be avoided, thus preserving
them for future, possibly critical, tasks. For the same reason, it is in general,
undesirable to continuously forward traffic via the same path, even though it
minimizes the energy, up to the point where the nodes on that path are
depleted of energy, and the network connectivity is compromised. It would,
instead, be preferable to spread the load more uniformly over the network.
This general guideline [33] can increase the network lifetime in typical
scenarios, although this is not always the case as the optimal distribution of
traffic load during routing is possible only when future network activity is
known.
Closely related to the issue of energy-efficient routing is the issue of
transmission power control. Many radios provide the ability for the higher
layer protocols to control the transmit power. The larger the transmit power,
the longer the range to which the radio can transmit. The result of a longer
range is two-fold: a richer network connectivity as a node can reach more
nodes in a single hop and a higher level of interference as the effect of a
node transmission is felt in a larger area around it.
The close relationship between routing and transmit power control can be
used by a routing protocol to save energy. The protocol can dynamically
select the optimum transmission power levels to minimize the energy spent
to route data between nodes and to even the energy consumption among the
nodes.
It is more energy-efficient to follow a multi-hop path instead of a direct
transmission if certain conditions are met. The conditions are related to the
radio signal attenuation characteristics, the distances covered in the two
cases (i.e., multi-hop and direct), and the radio characteristics. To analyze
this further, recall the transmission and receive energy per bit for
communication over a distance d in the radio-model introduced earlier:

Now consider two different cases for sending data from node A to node
B that is distance r away: direct routing and multi-hop. The first case is
direct routing where the transmit power of node A is set so that its range is
Higher Layer Techniques 331

equal to r. The total energy spent in transmitting a bit from node A to B is


given by:

The second case is multi-hop routing, where the scenario considered is


one in which the data is routed from A to B using N intermediate multi-hop
relay nodes. For simplicity, consider the case when the relay nodes are
equidistant and lie along the straight line from A to B. The relay nodes will
be distance apart, and the transmit powers of node A and the relay nodes
are set so that their range is to reach the next hop destination. The total
energy spent in transmitting a bit from node A to B is given by the
expression below. If the relay nodes are not equidistant or are not along the
straight line, the energy for the multi-hop case will be higher. So this
analysis represents the best case for multi-hop.

So, when is multi-hop better? If N is given, one can show that multi-hop
routing leads to lower energy if the following condition is satisfied:

If, on the other hand, one is allowed to choose N, then the optimum N for
multi-hop is given by

The condition under which multi-hop routing with optimally chosen N


leads to lower energy is obtained by plugging in to equation (11.22).
Researchers have exploited these relationships in designing routing
protocols. Some of these efforts use these relationships to decide at the
design-time whether to use multi-hop routing or direct routing with clusters
[12]. Others evaluate these relationships at run-time for the selection of
energy efficient paths and corresponding transmission power settings [31].
Much of the existing work assumes that is zero, leading to results that
are not applicable to devices with short-range radios.
332 Power-aware Communication Systems

11.6 SUMMARY

The extreme energy constraints of networked embedded systems together


with the dominance of communication energy over computation energy
make it imperative that power-management techniques begin to address
communication. Moreover, the networked nature necessitates a network-
wide perspective to power management, instead of power management just
at a single node. This chapter gave a glimpse of some of the techniques that
are possible at various layers of the network in the case of wireless
communications. At the level of a single wireless link, modulation and code
scaling based power management is effective for long-range
communications while shutdown-based power management is effective for
short-range communications. At the level of the network, energy can be
traded against density, latency, and accuracy via routing and topology
management. While the chapter focused on wireless communications where
the energy problems are particularly severe, ideas such as modulation and
code scaling are applicable to wired communications as well.

ACKNOWLEDGEMENTS

The author would like to acknowledge the contributions of his current and
past students at UCLA’s Networked and Embedded Systems Laboratory
(http://nesl.ee.ucla.edu) on whose research this chapter is based. In particular
the research by Andreas Savvides, Athanassios Boulis, Curt Schurgers, Paul
Lettieri, Saurabh Ganeriwal, Sung Park, Vijay Raghunathan, and Vlasios
Tsiatsis has played a significant role in formulating the ideas expressed in
this chapter.

REFERENCES

[1] Chandrakasan, A., Sheng, S., Brodersen, R., “Low-power CMOS digital Design,” IEEE
Journal of Solid-State Circuits, Vol.27, pp. 473-484, Dec 1992.
[2] Benini, L., Bogiolo, A., De Micheli, G., “A survey of design techniques for system-level
dynamic power management,” IEEE Transactions on CAD, pp. 813-833, June 1999.
[3] Raghunathan, V., Schurgers, C., Park, S., Srivastava, M., “Energy aware microsensor
networks,” IEEE Signal Processing, vol. 19, no. 2, pp. 40-50, March 2002.
[4] Nielsen, L., Niessen, C., Sparsø, J., van Berkel, K., “Low power operation using self-
timed circuits and adaptive scaling of the supply voltage,” IEEE Trans. on VLSI Systems,
Vol.2, No.4, pp. 391-397, Dec 1994.
[5] Pottie, G.J., Kaiser, W.J., “Wireless integrated network sensors,” Communications of the
ACM, vol.43, (no.5), pp.51-58, May 2000.
Summary 333

[6] Srivastava, M., Chandrakasan, A., Brodersen, R., “Predictive system shutdown and other
architectural techniques for energy efficient programmable computation,” IEEE Trans.
on VLSI Systems, vol. 4, no. 1, pp. 42-55, March 1996.
[7] Gruian, F., “Hard real-time scheduling for low energy using stochastic data and DVS
processor,” ACM ISLPED '01, pp. 46-51, Huntington Beach, CA, August 2001.
[8] Raghunathan, V., Spanos, P., Srivastava, M., “Adaptive power-fidelity in energy aware
wireless systems,” RTSS'01, pp. 106-115, London, UK, December 2001.
[9] Weiser, M., Welch, B., Demers, A., Shenker, B., “Scheduling for reduced CPU energy,”
USENIX Symposium on Operating Systems Design and Implementation, pp. 13-23, Nov
1994.
[10] Yao, F., Demers, A., Shenker, S., “A scheduling model for reduced CPU energy,” 36th
Annual Symposium on Foundations of Computer Science, Milwaukee, WI, pp. 374-385,
Oct 1995.
[11] Proakis, J., “Digital Communications,” McGraw-Hill Series in Electrical and Computer
Engineering, Edition, 1995.
[12] Heinzelman, W., Chandrakasan, A., Balakrishnan, H., “Energy-efficient communication
protocol for wireless microsensor networks,” HICSS 2000, pp. 3005-3014, Maui, HI,
Jan. 2000.
[13] Xu, Y., Heidemann, J., Estrin, D., “Geography-informed energy conservation for ad hoc
routing,” Proceedings of the Seventh Annual International Conference on Mobile
Computing and Networking, pp. 70-84, Rome, Italy, July 2001.
[14] Wang, A., Cho, S-H., Sodini, C.G., Chandrakasan, A.P., “Energy-efficient modulation
and MAC for asymmetric microsensor systems,” ACM ISLPED, pp. 106-111,
Huntington Beach, CA, August 2001.
[15] Savvides, A., Park, S., M. Srivastava, “On modeling networks of wireless micro-
sensors,” ACM SIGMETRICS 2001, pp. 318-319, Cambridge, MA, June 2001.
[16] Srivastava, M.B. “Design and Optimization of Networked Wireless Information
Systems,” IEEE Computer Society Workshop on VLSI, pp. 71-76, April 1998.
[17] Schurgers, C., Aberthorne, O., Srivastava, M., “Modulation scaling for energy aware
communication systems,” ACM ISLPED'01, pp.96-99, Huntington Beach, CA, August
2001.
[18] Schurgers, C., Raghunathan, V., Srivastava, M., “Modulation scaling for real-time
energy aware packet scheduling,” Globecom'01, pp. 3653-3657, San Antonio, TX,
November 2001.
[19] Prabhakar, B., Biyikoglu, E., Gamal, A., “Energy-efficient transmission over a wireless
link via lazy packet scheduling,” Infocom’01, pp. 386-394, April 2001.
[20] Frenger, P., Orten, P., Ottosson, T., Svensson, A., “Multi-rate convolutional codes,”
Tech. Report No. 21, Chalmers University of Technology, Sweden, April 1998.
[21] Jeffay, K., Stanat, D., Martel, C., “On non-preemptive scheduling of periodic and
sporadic tasks,” RTSS’91, San Antonio, TX, pp. 129-139, Dec. 1991.
[22] Schurgers, C., Srivastava, M., “Energy efficient wireless scheduling: adaptive loading in
time,” WCNC’02, Orlando, FL, March 2002.
[23] Lettieri, P., Fragouli, C., Srivastava, M.B., “Low power error control for wireless links,”
ACM MobiCom '97, Budapest, Hungary, pp. 139-150, Sept. 1997.
[24] Lettieri, P., Srivastava, M.B., “Adaptive frame length control for improving wireless link
throughput, range, and energy efficiency,” IEEE INFOCOM'98 Conference on Computer
Communications, vol. 2, pp.5 64-71, March 1998.
334 Power-aware Communication Systems

[25] Sivalingam, K.M., Chen, J.-C., Agrawal, P., Srivastava, M.B., “Design and analysis of
low-power access protocols for wireless and mobile ATM networks,” ACM/Baltzer
Wireless Networks, vol.6, (no. 1), ACM/ Baltzer, February 2000. p.73-87.
[26] Sohrabi, K., Gao, J., Ailawadhi, V., Pottie, G.J., “Protocols for self-organization of a
wireless sensor network,” IEEE Personal Communications, vol.7, (no.5), pp. 16-27, Oct.
2000.
[27] Woo, A., Culler, D., “A transmission control scheme for media access in sensor
networks,” Proceedings of the Seventh Annual International Conference on Mobile
Computing and Networking, pp. 221-235, Rome, Italy, July 2001.
[28] Ye, W., Heidemann, J., Estrin, D., “An energy-efficient MAC protocol for wireless
sensor networks,” IEEE INFOCOM'02 Conference on Computer Communications, June
2002.
[29] Chen, B., Jamieson, K., Balakrishnan, H., Morris, R. “Span: an energy-efficient
coordination algorithm for topology maintenance in ad hoc wireless networks,”
MobiCom 2001, Rome, Italy, pp. 70-84, July 2001.
[30] Schurgers, C., Tsiatsis, V., Ganeriwal, S., and Srivastava, M., “Topology management
for sensor networks: exploiting latency and density,” The Third ACM International
Symposium on Mobile Ad Hoc Networking and Computing (ACM Mobihoc 2002),
Lausanne, Switzerland, June 2002.
[31] Chang, J.-H., Tassiulas, L., “Energy conserving routing in wireless ad-hoc networks,”
IEEE INFOCOM’00 Conference on Computer Communications, Tel Aviv, Israel, pp.
22-31, March 2000.
[32] SIngh, S., Woo, M., Raghavendra, C.S., “Power-aware routing in mobile ad hoc
networks,” Proceedings of the Fourth Annual ACM/IEEE International Conference on
Mobile Computing and Networking, pp. 181-190, Dallas, Texas, October 1998.
[33] Schurgers, C., Srivastava, M., “Energy efficient routing in sensor networks,” Proc.
Milcom, pp. 357-361, Vienna, VI, October 2001.
[34] Guo, C., Zhong, L., Rabaey, J., “Low-poer distributed MAC for ad hoc sensor radio
networks,” IEEE Globecom’01, pp. 2944-2948, San Antonio, TX, Nov 2001.
Chapter 12
Power-Aware Wireless Microsensor Networks

Rex Min, Seong-Hwan Cho, Manish Bhardwaj, Eugene Shih, Alice Wang,
Anantha Chandrakasan
Massachusetts Institute of Technology

Abstract: Distributed networks of thousands of collaborating microsensors promise a


maintenance-free, fault-tolerant platform for gathering rich, multi-dimensional
observations of the environment. As a microsensor node must operate for
years on a tiny battery, careful and innovative techniques are necessary to
eliminate energy inefficiencies overlooked in the past. For instance, properties
of VLSI (Very Large Scale Integration) hardware, such as leakage and the
start-up time of radio electronics, must be considered for their impact on
system energy, especially during long idle periods. Nodes must gracefully
scale energy consumption in response to ever-varying performance demands.
All levels of the communication hierarchy, from the link layer to media access
to routing protocols, must be tuned for the hardware and application. Careful
attention to the details of energy consumption at every point in the design
process will be the key enabler for dense, robust microsensor networks that
deliver maximal system lifetime in the most challenging and operationally
diverse environments.

Key words: Sensor networks, energy dissipation, power awareness, energy scalability,
communication vs. computation tradeoff, StrongARM SA-1100, leakage
current, processor energy model, radio energy model, dynamic voltage scaling,
adjustable radio modulation, adaptive forward error correction, media access
control, multihop routing, data aggregation, energy-quality scalability, low
power transceiver, FIR filtering, project.

12.1 INTRODUCTION

In recent years, the idea of wireless microsensor networks has garnered a


great deal of attention by researchers, including those in the field of mobile
computing and communications [1][2]. A distributed, ad-hoc wireless
microsensor network [3] consists of hundreds to several thousands of small
336 Power-Aware Wireless Microsensor Networks

sensor nodes scattered throughout an area of interest. Each individual sensor


contains both processing and communication elements and is designed to
monitor the environment for events specified by the network deployer.
Information about the environment is gathered by the sensors and is
delivered to a central base station where the user can extract the desired data.
Because of the large number of nodes in such a network, sensors can
collaborate to perform high quality sensing and form fault-tolerant sensing
systems. With these advantages in mind, many applications have been
proposed for distributed, wireless microsensor networks such as warehouse
inventory tracking, location-sensing, machine-mounted sensing, patient
monitoring, and building climate control [2][4][5][6].
Many of the necessary components and enabling technologies for
wireless microsensor networks are already in place. Microscopic MEMS
motion sensors are routinely fabricated on silicon. Digital circuits shrink in
area with each new process technology. Entire radio transceivers—including
the associated digital electronics—are fabricated on a single chip [7].
Refinements of these enabling technologies will soon yield the form factors
practical for a microsensor node.

Despite these advances in integration, many system challenges remain.


Because the proposed applications are unique, wireless microsensor systems
will have different challenges and design constraints than existing wireless
networks, such as cellular networks and wireless LANs. Table 12.1 contrasts
the operational characteristics of microsensors and wireless LAN devices. In
brief, a microsensor node is the antithesis of high-bandwidth or long-range
communication: node densities are higher, transmissions shorter, and data
rates lower than any previous wireless system. Thus, large-scale data man-
agement techniques are necessary. Secondly, user constraints and environ-
mental conditions, such as ambient noise and event arrival rate, can be time-
varying in a wireless microsensor network. Thus, the system should be able
to adapt to these varying conditions.
Introduction 337

As a concrete application example, a network specification for a machine


monitoring application [31] specifies up to 12 nodes per square meter and a
maximum radio link of 10 meters. Nodes are expected to process about 20
two-byte radio transmissions per second. The required battery life is five to
ten years from an “AA”-sized cell.
This final requirement suggests that the greatest challenge in microsensor
design arises from the energy consumption of the underlying hardware.
Because applications involving wireless sensor networks require long system
lifetimes and fault-tolerance, energy usage must be carefully monitored. Fur-
thermore, since the networks can be deployed in inaccessible or hostile envi-
ronments, replacing the batteries that power the individual nodes is
undesirable, if not impossible. In contrast to the rapid advances in
integration, battery energy densities have improved only slowly. While the
density of transistors on a chip, for example, has consistently doubled every
18 months, the energy density of batteries has doubled every five to twenty
years, depending on the particular chemistry. Prolonged refinement of any
chemistry yields diminishing returns [9]. Moore’s Law simply does not
apply to batteries, making energy conservation strategies essential for
extending a node’s lifetime.
This need to minimize energy consumption and to maximize the lifetime
of a system makes the design of wireless sensor networks difficult. For
example, since packets can be small and data rates low, low-duty cycle radio
electronics will be used in the system. However, designing such circuits so
that they are energy-efficient is technically challenging. Current commercial
radio transceivers, such as those proposed for the Bluetooth standard [10],
are not well suited for microsensor applications since the energy overhead of
turning them on and off is high. Thus, innovative solutions in transceiver and
protocol design are required to achieve efficient transmission of short
packets over short distances.
Another challenge arises from the remote placement of these nodes and
the high cost of communication. Since sensors are remotely deployed,
transmitting to a central base station has a high energy cost. Thus, the use of
data aggregation schemes to reduce the amount of redundant data in the
network is beneficial [11]. Finally, since environmental conditions and user
constraints can be time-varying, the use of static algorithms and protocols
can result in unnecessary energy consumption. Wireless microsensors must
allow underlying hardware to adapt with higher-level algorithms. By giving
upper layers the opportunity to adapt the hardware in response to changes in
system state, the environment, and the user's quality constraints, the energy
consumption of the node can be better controlled.
In summary, reducing energy consumption to extend system lifetime is a
primary concern in microsensor networks. Thus, protocols and algorithms
338 Power-Aware Wireless Microsensor Networks

should be designed with power dissipation in mind. Power-aware design


begins with a firm understanding of the energy-consumption characteristics
of the hardware and the design of all system layers for graceful energy
scalability with changing operational demands. The design of a power-aware
communication system requires a choice of a link layer and media access
schemes tailored to the power-dissipation properties of the radio and careful
collaboration among intermediate nodes during data relay. The following
sections discuss each of these topics in order, as well as a prototype
microsensor node developed with power awareness at all levels of the
system hierarchy.

12.2 NODE ENERGY CONSUMPTION


CHARACTERISTICS

12.2.1 Hardware Architecture


Figure 12.1 presents a generalized architecture for a microsensor node.
Observations about the environment are gathered using some sensing sub-
system consisting of sensors connected to an analog-to-digital (A/D) con-
verter. Once enough data is collected, the processing subsystem of the node
can digitally process the data in preparation for relay to a nearby node (or
distant base station). Portions of the processing subsystem that are micropro-
Node Energy Consumption Characteristics 339

cessor-based would also include RAM and flash ROM for data and program
storage and a an operating system with light memory and computa-
tional overhead. Code for the relevant data processing algorithms and com-
munication protocols are stored in ROM. In order to deliver data or control
messages to neighboring nodes, data is passed to the node’s radio subsystem.
Finally, power for the node is provided by the battery subsystem with DC-
DC conversion to provide the voltages required by the aforementioned
components.
It is instructive to consider the power consumption characteristics of a
microsensor node in three parts: the sensing circuitry, the digital processing,
and the radio transceiver. The sensing circuitry, which consists of the
environmental sensors and the A/D converter, requires energy for bias
currents, as well as amplification and analog filtering. Its power dissipation
is relatively constant while on, and improvements to its energy-efficiency
depend on increasing integration and skilled analog circuit design. This
section considers the energies of the remaining two sections—digital
computation and radio transmission—and their relationship to the
operational characteristics of a microsensor node.

12.2.2 Digital Processing Energy

A node’s digital processing circuits are typically used for digital signal
processing of gathered data and for implementation of the protocol stack.
Energy consumed by digital circuits consists of dynamic and static
dissipation as follows:

The dynamic energy term is with C representing the switched


capacitance and the supply voltage. The dynamic energy of digital com-
putation is the energy required to switch parasitic capacitors on an integrated
circuit. The static dissipation term, originating from the undesirable current
leakage from power to ground at all times, is set by the thermal voltage
and constants and n that can be measured for a given process technology.
Note that, for a constant supply voltage switching energy for any
given computation is independent of time while leakage energy is linear with
time. While switching energy has historically exceeded leakage energy for
modern CMOS applications [12], the trend is beginning to reverse with
recent semiconductor process technologies. Each new process generation
340 Power-Aware Wireless Microsensor Networks

increases leakage threefold; leakage in advanced process technologies will


soon approach 50% of a digital circuit’s operating power [13].
Even in a process technology that is typically dominated by switching
energy, a microsensor’s long idle periods impose a low-duty cycle on its pro-
cessor, encouraging the dominance of leakage energy.

Figure 12.2 illustrates the possibility of leakage dominance with


measured data from the StrongARM SA-1100 microprocessor [14]. The SA-
1100 is a commercial low-power microprocessor whose energy consumption
has been extensively characterized [15][30] and, therefore, serves as a
recurring example throughout this chapter. Graph (a) shows that leakage
energy begins to dominate over switching energy as the processor’s duty
cycle is reduced. Leakage energy is proportional to time, whether or not the
processor is doing useful work. As a result, slowing the clock down actually
increases the amount of leakage energy within each clock period, causing
the energy per operation to increase. This is illustrated by graph (b). A
reduction in clock frequency reduces the number of idle cycles, but leakage
nonetheless remains proportional to time.
The easiest way to reduce leakage is to shut down the power supply to
the idle components. For relatively simple circuits such as analog sensors,
circuits can be powered up and down quickly, with no ill effect. Shutting
down more complicated circuits, however, requires a time and energy
overhead. For example, powering down a processor requires the preservation
of its state. If the processor should be needed immediately after it has been
powered down, the energy and time required to save and restore the state is
wasted. In choosing any shutdown policy, the hidden time and energy cost to
Node Energy Consumption Characteristics 341

shut a circuit down must be balanced with the expected duration of


shutdown.

12.2.3 Radio Transceiver Energy

The issues of static power and the shutdown cost, two key concerns for the
node’s digital circuits, emerge analogously in the node’s radio. The energy
consumption of the radio consists of static power dissipated by the analog
electronics (analogous to leakage in the digital case, except that these bias
currents serve to stabilize the radio) and the radiated RF energy. The radiated
energy, which scales with transmitted distance as to depending on envi-
ronmental conditions, has historically dominated radio energy. For closely
packed microsensors, however, the radio electronics are of greater concern.
The average power consumption of a microsensor radio can be described
by:

where is the average number of times per second that the


transmitter/receiver is used, is the power consumption of the
transmitter/receiver, is the output transmit power, is the
transmit/receive on-time (actual data transmission/reception time), and is
the start-up time of the transceiver. will largely depend on the
application scenario and the media-access control (MAC) protocol being
used. Furthermore, where L is the packet size in bits, and R is
the radio’s data rate in bits per second. The power amplifier is assumed to be
on only when communication occurs.
Two key points are particularly noteworthy about microsensor radios.
First, the transceiver power is not likely to vary with the data rate R. In
gigahertz frequency bands, such as the popular 2.4 GHz ISM band, the
power consumption of the transceiver is dominated by the frequency
synthesizer, which generates the carrier frequency. Hence, to first order, R
does not affect the power consumption of the transceiver [16].
Second, the startup time, is a significant concern. During the startup
time, no data can be sent or received by the transceiver. This is because the
internal phase-locked loop (PLL) of the transceiver must be locked to the
desired carrier frequency before data can be demodulated successfully.
Figure 12.3 plots the measured start-up transient of a commercial 2.4 GHz
low power transceiver [17]. The control input to the voltage-controlled
oscillator (in volts) is plotted versus time.
342 Power-Aware Wireless Microsensor Networks

The start-up time can significantly impact the average energy


consumption per bit, since low-rate wireless sensor networks tend to
communicate with very short packets. In order to save power, a natural idea
is to turn the radio off during idle periods. Unfortunately, when the radio is
needed again, a large amount of power is dissipated to turn it on again;
transceivers today require an initial start-up time on the order of hundreds of
microseconds during which a large amount of power is wasted. Figure 12.4
illustrates the effect of the start-up transient for and
dBm. Energy consumption per bit is plotted versus the packet size.
Power Awareness Through Energy Scalability 343

12.3 POWER AWARENESS THROUGH ENERGY


SCALABILITY

Great inefficiencies can occur when a microsensor node’s time-varying


demands are mapped onto hardware whose energy consumption is invariant
to operational diversity. This section therefore motivates energy scalability,
the design of circuits, architectures, and algorithms that can gracefully trade-
off performance and quality for energy savings. The key realization is that
the node’s high operational diversity will necessarily lead to energy
inefficiency for a node that is optimized for any one operating point. Perhaps
the nodes are accidentally deployed more closely together than expected,
stimuli are fewer and farther between, or the user chooses to tolerate a few
additional milliseconds of latency. A node that cannot respond to relaxed
performance demands with energy reductions is utilizing energy in an
inefficient manner.
Power awareness, then, is really an awareness of the exact performance
demands of the user and the environment. A power-aware system consumes
just enough energy to achieve only that level of performance. Power-aware
systems exhibit this characteristic at all levels of the system hierarchy.
Energy trade-offs are enabled at the circuit-level and exploited at the
algorithm-level. This section illustrates how energy can be traded for a
variety of performance parameters at many levels of the system hierarchy.

12.3.1 Dynamic Voltage Scaling

Dynamic voltage scaling (DVS) exploits variabilities in processor workload


and latency constraints and allows energy to be traded for quality at the
circuit level [18][19]. As discussed in Section 12.2.2 , the switching energy
of any particular computation is a quantity that is
independent of time. Reducing the supply voltage offers a quadratic
savings in switching energy at the expense of additional propagation delay
through static logic. Hence, if the workload on the processor is light or the
latency tolerable by the computation is high and the processor frequency
can be reduced together to trade latency for energy savings.
As a practical example on a microprocessor, Figure 12.5 depicts the
measured energy consumption of the SA-1100 processor running at full
utilization. The energy consumed per operation is plotted with respect to the
processor frequency and voltage. As expected, a reduction in clock
frequency allows the processor to run at a lower voltage. The quadratic
dependence of switching energy on supply voltage is evident, and for a fixed
voltage, the leakage energy per operation increases as the operations occur
over a longer clock period.
344 Power-Aware Wireless Microsensor Networks

Using a digitally adjustable DC-DC converter, the SA-1100 can adjust its
own core voltage to demonstrate energy-quality tradeoffs with DVS. In
Figure 12.6a, the latency (an inverse of quality) of the computation is
shown to increase as the energy decreases, given a fixed computational
workload.
In
Figure 12.6b, the quality of a FIR filtering algorithm is varied by scaling
the number of filter taps. As the number of taps—and hence the
Power Awareness Through Energy Scalability 345

computational workload—decreases, the processor can run at a lower clock


speed and, therefore, operate at a lower voltage. In each example, DVS-
based implementation of energy-quality tradeoffs consume up to 60% less
energy than a fixed-voltage processor. “Voltage scheduling” [20][21] by the
processor’s operating system enables the processor voltage to be varied
rapidly in response to a time-varying workload.

12.3.2 Ensembles of Systems

CMOS circuits become complete digital systems through the collaboration


of functional units such as adders, multipliers, and memory cells. The
following example considers the power awareness of a multiplication unit,
specifically the energy scalability over the bit width of its computation.
Multiplier circuits are typically designed for a fixed maximum operand
346 Power-Aware Wireless Microsensor Networks

size, such as 64 bits per input. In practice, however, typical inputs to the
multiplier are far smaller. Calculating, for instance, an 8-bit multiplication
on a 64-bit multiplier can lead to serious energy inefficiencies due to
unnecessary digital switching on the high bits. The problem size of the
multiplication is a source of operational diversity, and large, monolithic
multiplier circuits are not sufficiently energy-scalable.
An architectural solution to input bit-width diversity is the incorporation
of additional, smaller multipliers of varying sizes, as illustrated in Figure
12.7. Incoming multiplications are routed to the smallest multiplier that can
compute the result, reducing the energy overhead of unused bits. An
ensemble of point systems, each of which is energy-efficient for a small
range of inputs, takes the place of a single system whose energy
consumption does not scale as gracefully with varying inputs. The size and
composition of the ensemble is an optimization problem that accounts for the
probabilistic distribution of the inputs and the energy overhead of routing
them [22]. In short, an ensemble of systems improves power-awareness for
digital architectures with a modest cost in chip area. As process technologies
continue to shrink digital circuits, this area trade-off will be increasingly
worthwhile.

12.3.3 Variable Radio Modulation

The modulation scheme used by the radio is another important trade-off that
can strongly impact the energy consumption of the node. As evidenced by
equation (12.2), one way to increase the energy efficiency of communication
is to reduce the transmission time of the radio. This can be accomplished by
sending multiple bits per symbol, that is, by using M-ary modulation. Using
M-ary modulation, however, will increase the circuit complexity and power
consumption of the radio. In addition, when M-ary modulation is used, the
efficiency of the power amplifier is also reduced. This implies that more
power will be needed to obtain reasonable levels of transmit output power.
The architecture of a generic binary modulation scheme is shown in
Figure 12.8(a), where the modulation circuitry is integrated together with
the frequency synthesizer [23] [17]. To transmit data using this architecture,
the VCO can be either directly or indirectly modulated. The architecture of a
radio that uses M-ary modulation is shown in Figure 12.8(b). Here, the data
encoder parallelizes serially input bits and then passes the result to a digital-
to-analog converter (DAC). The analog values produced serve as output
levels for the in-phase (I) and quadrature (Q) components of the output
signal.
Power Awareness Through Energy Scalability 347

The energy consumption for the binary modulation architecture can be


expressed as

while the energy consumption for M-ary modulation is

In these equations, and represent the power consumption of


the binary and M-ary modulation circuitry, and represent the
power consumed by the frequency synthesizer, and represent the
output transmit power for binary or M-ary modulation, is the transmit on
time, and is the startup time, for M-ary modulation is less than for
binary modulation for the same number of bits. Note that the
number of bits per symbol. The factors of and can be expressed as
348 Power-Aware Wireless Microsensor Networks

Here, represents the radio of the power consumption of the modulation


circuitry between M-ary and binary modulation, while is the ratio of
synthesizer power between the M-ary and binary schemes. These parameters
represent the overhead that is added to the modulation and frequency
synthesizer circuitry when replacing a binary modulation scheme with an M-
ary modulation scheme.
Comparing (12.3) and (12.4) suggests that M-ary modulation achieves a
lower energy consumption when the following condition is satisfied.

The last two terms of equation (12.6) can be ignored since and
are negligible compared to the power of the frequency synthesizer. A
comparison of the energy consumption of binary modulation and M-ary
modulation is shown in Figure 12.9. In the figure, the ratio of the energy
consumption of M-ary modulation to the energy consumption of binary
modulation is plotted versus the overhead
Power Awareness Through Energy Scalability 349

In Figure 12.9, M is varied to produce different M-ary modulation


schemes. For each scheme, the start-up time is also varied. 100-bit packets
are sent at 1 Mbps. In a M-ary scheme, megasymbols are sent per
second, and the on time is decreased.
As expected, the M-ary modulation scheme achieves the lowest energy
when the overhead is small, and is about When the start-up time
is about 200 however, the energy consumption is higher for M-ary
modulation regardless of This is because the energy consumption due to
the start-up time dominates the energy consumption due to the transmitter’s
on time. Hence, reducing by using a larger M has a negligible effect on
the total energy consumption when is high.

12.3.4 Adaptive Forward Error Correction

In any protocol stack, the link layer has a variety of purposes. One of the
tasks of the link layer is to specify the encodings and length limits on
packets such that messages can be sent and received by the underlying
physical layer. The link layer is also responsible for ensuring reliable data
transfer. This section discusses the impact of variable-strength error control
on the energy consumption of a microsensor node. An additional and similar
exploration of the impact of adapting packet size and error control on system
energy efficiency is available in [24].
The level of reliability provided by the link layer will depend on the
needs of the application and on user-specified constraints. In many wireless
sensor networks, such as machine monitoring and vehicle detection
networks, the actual data will need to be transferred with an extremely low
probability of error.
In a microsensor application, it is assumed that objects of interest have
high mobility (e.g., moving vehicles) and nodes are immobile. Thus, the
coherence time of the channel is not much larger than the signaling time of
Given this scenario, the nodes can be assumed to be communicating
over a frequency non-selective, slow Rayleigh fading channel with additive
white Gaussian noise. This is a reasonable channel model to use for
communication at 2.4 GHz where line-of-sight communication is not always
possible.
Consider one node transmitting data to another over such a channel using
the radio described in Section 12.2.3 . The radio presented uses non-coherent
binary frequency-shift keying (FSK) as the modulation scheme. For
purposes of comparison, the best achievable probability of error using raw,
non-coherent binary FSK over a slowly fading Rayleigh channel will be
350 Power-Aware Wireless Microsensor Networks

presented. Let be a function of the received energy per bit to noise power
ratio
In general, where is a random variable for a fading
channel. It is shown in [25] that the probability of error using non-coherent,
orthogonal binary FSK is where is the average
Unfortunately, this does not directly tell us the amount of transmit power
that is required to achieve a certain probability of error. To determine as a
function of requires consideration of the radio implementation. In gen-
eral, can be converted to using

where represents the large-scale path loss, is the average attenuation


factor due to fading, W is the signal bandwidth, is the thermal noise, and
is the noise contributed by the receiver circuitry known as the noise
figure. In general,
A conservative estimate for is about 70 dB. A radio with signal
bandwidth W = 1 MHz, dBm, and results in
assuming a data rate of 1 Mbps. This equation can
be used to find the transmit power needed to obtain a certain average
The uncoded curve in Figure 12.10 shows the probability of error plotted
against the output power of the transmitter.
Since using a power amplifier alone is highly inefficient, forward error
correction (FEC) can be applied to the data to decrease the probability of
error. Many types of error-correcting codes can be used to improve the
probability of bit error. The following discussion considers convolutional
codes with base coding rates of and punctured derivatives. For a
frequency non-selective, Rayleigh fading channel, a bound on the can be
determined by applying

Here d represents the Hamming distance between some path in the trellis
decoder and the all-zero path, the coefficients can be obtained from the
expansion of the first derivative of the transfer function, P(d) is the first-
event error probability, and is the minimum free distance [25]. Figure
12.10 plots the for codes with varying rates and constraint lengths K.
Power Awareness Through Energy Scalability 351

The probabilities shown assume the use of a hard-decision Viterbi decoder at


the receiver. Greater redundancy (lower rate) or more memory (higher
constraint length) lowers the output power for a given From this
perspective, coding should always be used.
While FEC decreases the transmit power needed to achieve a given error
rate, the additional processing required will increase the energy of computa-
tion. Depending on the underlying architecture, this energy cost can be
significant. Additional processing energy, denoted by must be expended
in order to encode and decode the data. An additional energy cost will be
also be incurred during the communication of the message since encoding a
bit stream will increase the size of the packet by approximately thereby
increasing and the radio energy required to transmit a packet. Denoting
the energy to encode as and the energy to decode data as the
total energy cost of the communication can be derived from equation (12.2)
as

Given this model, it is possible to derive the average energy to transmit,


receive, encode, and decode each information bit. If is the code rate and L
is the packet length transmitted, then the number of information bits is
352 Power-Aware Wireless Microsensor Networks

Thus, the energy per useful bit is In general, for


convolutional codes, the energy required to encode data is negligible. Viterbi
decoding, on the other hand, can be energy-intensive, depending on the
implementation. Therefore, two very different implementations are consid-
ered: a C program on the SA-1100 processor and a dedicated VLSI applica-
tion-specific circuit. These choices represent the two extremes of energy
consumption.

Figure 12.11 plots the measured the energy per useful bit required to
decode 1/2 and 1/3-rate convolutional codes with varying constraint length
on the SA-1100. Two observations can be derived from these graphs. First,
the energy consumption scales exponentially with the constraint length. This
is to be expected since the number of states in the trellis increases
exponentially with constraint length. Second, the energy consumption
appears independent of the coding rate. This is reasonable since the rate only
affects the number of bits sent over the transmission. A lower-rate code does
not necessarily increase the computational energy since the number of states
in the Viterbi decoder is unaffected. In addition, the cost of reading the data
from memory is dominated by the updating of the survivor path registers in
the Viterbi algorithm. The size of the registers is proportional to the
constraint length and is not determined by the rate. Therefore, given two
convolutional codes and both with constraint lengths K, where
the per bit energy to decode and is the same even though
more bits are transmitted when using
Given the data in Figure 12.11, the convolutional code that minimizes
the energy consumed by communication can be determined for a given
Power Awareness Through Energy Scalability 353

probability of error In Figure 12.12, the total energy per information bit
is plotted against

Figure 12.12 shows that the energy per bit using no coding is lower than
that for coding for The reason for this result is that the energy of
computation, i.e., decoding, dominates the energy used by the radio for high
probabilities of error. For example, assuming the model described in
equation (12.9) and the communication energy to transmit and
receive per useful bit for an code is 85 nJ/bit. On the other hand, the
energy to decode an code on the SA-1100 is measured to be
2200 nJ per bit.
At lower probabilities of error, the power amplifier energy begins to
dominate. At these ranges, codes with greater redundancy have better
performance. These results imply that coding the data is not always the best
operational policy for energy-efficient operation. While it may appear that
this result is solely due to the inefficiency of the SA-1100 in performing
error correction coding, this result holds even for more efficient
implementations of Viterbi decoding.
Since using the SA-1100 to perform Viterbi decoding is energy ineffi-
cient, using a dedicated integrated circuit solution to perform decoding is
preferred. To explore the power characteristics of dedicated Viterbi
354 Power-Aware Wireless Microsensor Networks

decoders, 1/2-rate decoders with different constraint lengths are synthesized


using TSMC ASIC (application-specific integrated circuit)
technology. These designs are fully parallel implementations of the Viterbi
algorithm with a separate add-compare-select (ACS) unit for each state.
Using Synopsys Power Compiler, the energy per bit used by these designs is
estimated for the decoding of 20000 bits. Figure 12.13 shows the energy per
bit for various constraint lengths.
Using the ASIC implementation in conjunction with the present radio model,
the minimum energy code for a given probability of error can again be
determined. In Figure 12.14, the energy per useful bit is plotted against
The least-energy communication/computation scheme depends on the
probability of error desired at the receiver. For no coding should be
used, for the transceiver power is dominant at high probabilities of
error. Since coding the data will increase the on time of the transceiver,
coding increases the overall energy per useful bit. Once the overall
communication energy with coding is lower since the energy of the power
amplifier will begin to dominate.
Figure 12.14 reinforces the idea that coding the data may not necessarily
be the best solution if energy-efficiency is a criterion. Indeed, the coding
strategy should be one that enables energy scalability with the desired output
quality of the user.
Power-aware Communication 355

12.4 POWER-AWARE COMMUNICATION

This section moves upward in the protocol stack to consider the design of
power-aware media access (MAC) layers and routing protocols. For
maximal energy efficiency, the operational policies of the MAC and routing
protocols must be tailored to the energy consumption characteristics of the
hardware and the nature of the sensing application.

12.4.1 Low-Power Media Access Control Protocol

The design of an energy-efficient media access control layer must account


for the non-ideal behaviors of the physical layer. This section explores the
design of a MAC for a factory machine monitoring application where a
centralized base station gathers data from sensors that are spread over a
small region within 10 meters.
In Section 12.2.3 , the parameters and which denote the frequency
of communication, were stated to depend on the underlying MAC protocol.
At the MAC level, these parameters depend largely on the latency
requirement specified by the user.
The MAC protocols considered are limited to time division multiple
access (TDMA) and frequency division multiple access (FDMA). Other
356 Power-Aware Wireless Microsensor Networks

multi-access schemes, such as code division multiple access (CDMA), result


in excessive system complexity for a microsensor application. Contention-
based schemes (i.e., Aloha, CSMA, etc.) are also ruled out due to the high
cost of receiving acknowledgment signals. Moreover, the latency of a packet
cannot be guaranteed.
In a TDMA scheme, the full bandwidth of the channel is dedicated to a
single sensor for communication purposes. Thus, the signal bandwidth per
sensor is equal to the available bandwidth and sensors can transmit at the
highest data rate. Since the transmission on time of the radio model
described in equation (12.2) is inversely proportional to the signal
bandwidth, is minimized in TDMA schemes. On the other hand, in an

FDMA scheme, the signal bandwidth (total available bandwidth divided by


number of sensors) is minimal. Thus, is at its maximum.
A hybrid scheme involving both TDMA and FDMA (TDM-FDM) is also
possible. In a TDM-FDM scheme, both time and frequency are divided into
available transmission slots. Figure 12.15 illustrates each of the different
multiple-access schemes considered, where a shaded area indicates a valid
transmission slot for sensor
In the schemes where TDM is employed, a downlink from the base
station to the sensors is required to maintain time synchronization among the
Power-aware Communication 357

nodes in the network. Due to the finite error among each sensor's reference
clock, the base station must send synchronization (SYNC) packets to avoid
collisions among transmitted packets. Hence, the receiver circuitry of each
sensor must be activated periodically to receive the SYNC signals. As
explained in Section 12.2.3 , the receiver consumes more power than the
transmitter. Thus, it is necessary to reduce the average number of times the
receiver is active.
The number of times the receiver needs to be active depends on the
guard time the minimum time difference between two time slots in the
same frequency band, as shown in Figure 12.15. During no sensor is
scheduled to transmit any data. Thus, a larger guard time will reduce the
probability of packet collisions and thus, reduce the frequency of SYNC
signals and
If two slots in the same frequency band are separated by it will
take seconds for these two packets to collide, where is the percent
difference between the two sensors' clocks. Hence the sensors must be
resynchronized at least number of times every second. In other
words, the average number of times the receiver is active per second can be
written as Assuming that the total slot time available is written
as a formula can be derived relating to the latency
requirement of the transmitted packet, as follows:

where W is the available bandwidth, L is the length of the data packet in


bits, is the latency requirement of the transmitted packet, M is the
number of sensors, and h is the number of channels in the given band W. The
data rate R is assumed to equal the signal bandwidth, such that

From equation (12.10), it is apparent that as the number of channels


decreases, the guard time increases and decreases. Moreover, the
advantage of ideal FDMA is that a receiver at the sensor is not needed (i.e.,

Substituting equation (12.10) into equation (12.2) yields an analytical


formula for the optimum number of channels to achieve the lowest
power consumption.
358 Power-Aware Wireless Microsensor Networks
Power-aware Communication 359

The value of is determined by the ratio of the power consumption


between the transmitter and the receiver. As expected, receivers that con-
sume less power favor TDMA, while receivers with larger power prefer
FDMA.
As an example, the above results are applied to scenario where a sensor
sends, on average, twenty 100-bit packets per second
bits) with a 5 ms latency requirement The available bandwidth
is 10 MHz (W = 10 MHz), and the number of sensors in a cell is 300. The
resulting average power consumption is plotted in Figure 12.16 and Figure
12.17, where the horizontal axis is the number of channels available (h = 1
for TDMA, h = 300 for FDMA), and the vertical axis represents the average
power consumption. In Figure 12.16 the average power consumption is
plotted for various start-up times from to 1 ms. The average power
is at a minimum when a hybrid TDM-FDM scheme is used. The variation in
power consumption for different h decreases as increases since the overall
power consumption is dominated by the start-up time. Figure 12.17
illustrates how the power consumption curve would change if a different
radio receiver were used. varies while is held constant. As the receiver
power increases, increases. Again, TDMA does not achieve the lowest
power, despite its minimum on time, because of the receive energy required
for synchronization. As the number of TDM channels is reduced (perhaps
through the inclusion of FDM channels), guard time increases and
synchronization is required less frequently.

12.4.2 Minimum Energy Multihop Forwarding

Data in a sensor network are subject to two primary operations: the for-
warding of data to a remote base station and the aggregation of multiple
streams into a single, higher-quality stream. This section considers an
energy-efficient approach for performing the first of these two essential
functions.
Multihop forwarding utilizes several intervening nodes acting as relays to
prevent any node from having to spend too much transmit energy. A scheme
that transports data between two nodes such that the overall rate of energy
dissipation is minimized is called a minimum energy relay. The proper
placement of nodes for minimum energy relay can be derived by considering
the energy required for a generalized multihop relay.
To aid the presentation of the analysis, the total energy required to
transmit and receive a packet of data is represented as follows:
360 Power-Aware Wireless Microsensor Networks

where is the energy per packet consumed by the transmitter


electronics (including energy costs of imperfect duty cycling due to finite
start-up time), accounts for energy dissipated in the transmit amplifier
(including amplifier inefficiencies), and is the energy per packet
consumed by the receiver electronics. This condensed model follows from
the radio model in equation (12.2), for and above correspond to
from equation (12.2). The path loss
term in equation (12.3), however, assumes a variable-power amplifier that is
not considered in equation (12.2).
With and defined above, the energy consumed per second (i.e.,
power) by a node acting as a relay that receives data and then transmits it d
meters onward is

where and is the number of packets relayed per second (the


relay rate).

Figure 12.18 depicts a multihop relay scenario between nodes A and B.


If K-1 relays are introduced between A and B and separated by a distance D,
then the overall rate of dissipation is defined as

The term accounts for the fact that node A, the initiator of the relay,
need not spend any energy receiving. The receive energy needed at B is
disregarded because it is fixed regardless of the number of intervening
relays.
Power-aware Communication 361

To achieve minimum energy relay, then, two conditions are required.


First, the total relay power is minimized when all the hop distances
are made equal to D/K. (Since is strictly convex, this can be
proved directly from Jensen’s inequality.) The minimum energy relay for a
given distance D, then, has either no intervening hops or equidistant
hops where is completely determined by D.
Second, the optimal number of hops is always one of

where the distance called the characteristic distance, is independent of


D and given by

This result follows directly from the optimization of with


respect to K [26]. By substituting equation (12.8) into equa the
energy dissipation rate of relaying a packet over distance D can be bounded
by

with equality if and only if D is an integral multiple of


These results can be summarized in three points. First, for any loss index
n, the energy costs of transmitting a bit can always be made linear with dis-
tance. Second, for any given distance D, there are a certain optimal number
of intervening nodes acting as relays that must be used. Using more or
less than this optimal number leads to energy inefficiencies. Third, the most
energy efficient relays result when D is an integral multiple of the
characteristic distance.

12.4.3 Clustering and Aggregation

An alternative to multi-hop routing with special advantages for microsensor


networks is clustering. An example specifically developed for microsensor
applications is LEACH (Low Energy Adaptive Clustering Hierachy).
362 Power-Aware Wireless Microsensor Networks

LEACH is a cluster-based protocol in which groups of adjacent nodes orga-


nize themselves into ad hoc clusters. One node in each cluster, the cluster-
head, receives data from the other nodes in the cluster and forwards the
combined data from the entire cluster to the base station in a single long-dis-
tance transmission. New cluster heads (and new clusters as well) are chosen
at periodic intervals to rotate this energy-intensive role among the nodes
[27].
Cluster heads are a natural focal point for data aggregation, the fusion of
multiple streams of correlated data inputs into a single, high-quality output.
A class of algorithms known as beamforming algorithms can perform this
aggregation, either with or without knowledge of the nodes’ locations [28].
As environmental observations from adjacent nodes in the cluster are likely
to be highly correlated, aggregation is an essential collaborative tool to
reduce the number of bits transmitted over wireless links.
To illustrate clustering and aggregation, suppose a vehicle is moving over
a region where a network of acoustic sensing nodes has been deployed. To
determine the location of the vehicle, the line of bearing (LOB) to the
vehicle is found. In this scenario, the nodes are assumed to be clustered as
depicted in Figure 12.19 and multiple clusters autonomously determine the
source's LOB from their perspective. Individual sensors send data to a
clusterhead. The intersection of multiple LOBs determines the source's
location and can be calculated at the base station.
One approach to locate the source is to first estimate the LOB at the
cluster and then to transmit the result to the base station. Alternatively, all
the sensors could transmit their raw, collected data directly to the base
Power-aware Communication 363

station for processing. Figure 12.20 depicts the energy required for the first
approach compared to the energy required for the second approach. As the
distance from the sensor to the base station increases, it is more energy-
efficient to perform signal processing locally, at the sensor cluster.

12.4.4 Distributed Processing through System Partitioning

Within a sensor cluster performing data aggregation, energy consumption


can be further reduced by adapting to the underlying hardware parameters of
the node. By distributing a computation across the network, a cluster of
nodes can exploit parallelism to improve the energy efficiency of the
algorithm.
The line-of-bearing (LOB) estimation procedure discussed in the
previous section can be performed through frequency-domain delay-and-sum
beamforming [29],[28]. Beamforming is the act of summing the outputs of
filtered sensor inputs. In a simple delay-and-sum beamformer, the filtering
operations are delays or phase shifts. The first part of frequency-domain
beamforming is to transform collected data from each sensor into the
frequency domain using a 1024-point Fast-Fourier Transform (FFT). Then,
the FFT data is beamformed in twelve uniform directions to produce twelve
candidate signals. The direction of the signal with the most energy is the
LOB of the source. In this application, user requirements will be assumed to
impose a latency constraint of 20 ms on the computation.
364 Power-Aware Wireless Microsensor Networks

This source localization algorithm can be implemented in two different


ways. Assume each sensor i has a set of acoustic data This data can be
sent first to a local aggregator or clusterhead where all FFT and beamform-
ing operations are performed. This direct technique is illustrated in Figure
12.21 (a). Alternatively, each sensor can transform the data locally before
sending the data to the clusterhead. This distributed technique is illustrated
in Figure 12.21(b). Assuming the radio and processor models discussed in
Section 12.1, performing the FFTs locally while distributing the
computational load and reducing latency has no energy advantage. This is
because performing the FFTs locally does not reduce the amount of data that
needs to be transmitted. Thus, communication costs remain the same.
However, on hardware that supports dynamic voltage scaling (Section
12.3.1 ), the network can take advantage of the parallelized computational
load by allowing voltage and frequency to be scaled while still meeting
latency constraints. In a DVS-enabled system, there is an advantage to
distributed signal processing. By distributing computation, the clock rate can
be reduced at each sensor, allowing for a reduction in supply voltage. In
System Partition 1, the direct technique, all sensors sense data and transmit
their raw data to the clusterhead, where the FFTs and beamforming are
executed. The clusterhead performs the beamforming and LOB estimation
before transmitting the result back to the user. In order to be within the user's
latency requirement of 20 ms, all of the computation is done at the
clusterhead at the fastest clock speed, f = 206 MHz at 1.44 V. The energy
dissipated by the computation is 6.2 mJ, and the latency is 19.2 ms.
In System Partition 2, the distributed technique, the FFT task is parallel-
ized. In this scheme, the sensor nodes perform the 1024-pt FFTs on the data
Node Prototyping 365

before transmitting the data to the clusterhead. The clusterhead performs the
beamforming and LOB estimation. Since the FFTs are parallelized, the clock
speed and voltage supply of both the FFTs and the beamforming can be low-
ered. For example, if the FFTs at the sensor nodes are run at 0.85 V at 74
MHz while the beamforming algorithm is run at 1.17 Vat 162 MHz then,

with a latency of 18.4 ms, only 3.4 mJ is dissipated. This is a 45.2%


improvement in energy dissipation. This example shows that efficient
system partitioning by parallelism can yield large energy reductions.
Figure 12.22 compares the energy dissipated for System Partition 1
versus that for System Partition 2 with optimal voltage scheduling as the
number of sensors is increased from 3 to 10 sensors. This plot shows that a
30-65% energy reduction can be achieved with the system-partitioning
scheme. Therefore, protocol designers should consider DVS coupled with
computation system-partitioning when designing algorithms for sensor
networks.

12.5 NODE PROTOTYPING

The (micro-Adaptive Multi-domain Power-aware Sensors) node is a


wireless sensor node that exposes the underlying parameters of the physical
hardware to the system designer. a complete prototype node, has
366 Power-Aware Wireless Microsensor Networks

the ability to scale the energy consumption of the entire system in response
to changes in the environment, the state of the network, and protocol and
application parameters in order to maximize system lifetime and reduce
global energy consumption. Thus, all layers of the system, including the
algorithms, operating system, and network protocols, can adaptively
minimize energy usage.

12.5.1 Hardware Architecture

Figure 12.23 provides an overview of the sensor node architecture. Each


architectural block exhibits certain energy dissipation properties, from
leakage currents in the integrated circuits to the output quality and latency
requirements of the user. As a result, the energy consumption of every
component in the system can be exploited at the software level [8] to extend
system lifetime and meet user constraints. Figure 12.24 shows the node
implemented in actual hardware. Figure 12.25 illustrates the assembly of
node boards into a base station.
The sensing subsystem consists of a sensor connected to an analog-to-
digital (A/D) converter. The initial node contains an electret microphone for
acoustic sensing. However, a wider variety of sensors are supported. The
acoustic sensor is connected to a 12-bit A/D converter capable of converting
data at a rate of 125 kilosamples per second (kSPS). In the vehicle tracking
application, the required conversion rate is about 1 kSPS. An envelope
detector is also included to allow ultra-low energy sensing.
Node Prototyping 367

The primary component of the data and control processing subsystem is the
StrongARM SA-1110 microprocessor. Selected for its low-power con-
.sumption, performance, and static CMOS design, the SA-1110 runs at a
clock speed of 59 MHz to 206 MHz. The processing subsystem also includes
RAM and flash ROM for data and program storage. A multi-threaded
running on the SA-1110 has been customized to allow software to scale
368 Power-Aware Wireless Microsensor Networks

the energy consumption of the processor. Code for the algorithms and
protocols are stored in ROM.
Data from the StrongARM that is destined for neighboring nodes is
passed to the radio subsystem of the node via a 16-bit memory interface. A
Xilinx FPGA performs additional protocol processing and data recovery.
The primary component of the radio is a Bluetooth-compatible commercial
single-chip 2.4 GHz transceiver [17] with an integrated frequency
synthesizer. The on-board phase-locked loop (PLL), transmitter chain, and
receiver chain can be shut off via software or hardware control for energy
savings. To transmit data, an external voltage-controlled oscillator (VCO) is
directly modulated, providing simplicity at the circuit level and reduced
power consumption at the expense of limits on the amount of data that can
be transmitted continuously. The radio module, with two different power
amplifiers, is capable of transmitting at 1 Mbps at a range of up to 100 m.
Finally, power for the node is provided by the battery subsystem via a
single 3.6 V DC source with an energy capacity of approximately 1500
mAH. Switching regulators generate 3.3 V and adjustable 0.9-2.0 V supplies
from the battery. The 3.3 V supply powers all digital components on the
sensor node with the exception of the processor core. The core is specially
powered by a digitally adjustable switching regulator that can provide 0.9 V
to 2.0 V in thirty discrete increments. The digitally-adjustable voltage allows
the SA-1110 to control its own core voltage, enabling the use of the dynamic
voltage scaling technique discussed in Section 12.3.1 . This feedback loop
governing processor voltage is illustrated in Figure 12.26.
Future Directions 369

12.5.2 Measured Energy Consumption

Energy consumption measurements confirm the system’s energy scalability.


Peak power-consumption with all components at their peak activity is 960
mW. Power-aware design, however, allows power dissipation to scale to 30
mW during idle periods. Figure 12.27 illustrates the energy savings affected
by power-aware methodologies. Dynamic voltage scaling on the SA-1110
processor reduces total system energy by a factor of two. Shutting the entire
processor down, as well as the radio and sensor subsystems, further reduces
energy consumption.

12.6 FUTURE DIRECTIONS

This chapter has focused on hardware and algorithmic enablers for energy-
efficient microsensor networks. The final step in the design hierarchy—the
design of an application programming interface (API) and development tools
that will bring the functionality of the network into the hands of users—is an
emerging field of research. An ideal node API would expose the power-
aware operation of the node without sacrificing the abstraction of low-level
functionality. The API would enable an application to shut down or throttle
the performance of each hardware component on the node. Top-level API
calls directed at the network as a single entity would allow quality and
performance to be set and dynamically adjusted, allowing the network to
manage global energy consumption through energy-quality tradeoffs. Power-
370 Power-Aware Wireless Microsensor Networks

aware signal processing routines would be available as a library and would


be callable with ceilings on their latency or energy consumption.
Deploying software for a thousand-node microsensor network requires
advances in network simulation. A simulator and development environment
for microsensor networks must enable programmers to profile the perfor-
mance and energy-efficiency of their software and hardware over a variety
of operating conditions. Energy profiling requires the creation of accurate
energy models for the cost of computation and communication on the nodes.
Simulating thousand-node and larger networks will require new techniques
in high-speed simulation. Simulation results published in the literature rarely
exceed 100 nodes due to the limited performance of existing tools. Perhaps
the simulator could locate and cache redundant computations across nodes or
offer a trade-off between result precision and speed.
The deployment of the first successful sensor network will be an exciting
and revolutionary milestone. From there, one possible next step is a node
with an infinite lifetime. Since nodes are essentially sensing energy in the
environment, it seems reasonable to harvest it as well. A “sensor” that
efficiently transduces environmental energy into useful electrical energy is
an energy harvester. With the refinement of energy-harvesting techniques
that can gather useful energy from vibrations, spikes of radio energy, and the
like, self-powered circuitry is a very real possibility. Energy-harvesting
schemes developed in the lab have generated 10 mW of power from
mechanical vibrations— already enough for low-frequency digital signal
processing [15]. With continuing advances in energy harvesting and
improvements in node integration, a battery-free, infinite-lifetime sensor
network is possible.
Ultra-dense sensor networks are also a logical next step. As silicon
circuits continue to shrink, the physical size of the nodes themselves will
shrink as well. As node form factors shrink and researchers become
comfortable with network protocols for thousand-node networks, networks
with several hundred to several thousand nodes per square meter will begin
to appear. Such a dense network would offer a new level of parallelism and
fault-tolerance, as well as trivially small radio transmission energies.

12.7 SUMMARY

A microsensor network that can gather and transmit data for years demands
nodes that operate with remarkable energy efficiency. The properties of
VLSI hardware, such as leakage and the start-up time of radio electronics,
must be considered for their impact on system energy, especially during long
idle periods. Nodes must take advantage of operational diversity by
Summary 371

gracefully scaling back energy consumption, so that the node performs just
enough computation—and no more—to meet an application’s specific needs.
All levels of the communication hierarchy, from the link layer to media
access to protocols for routing and clustering, must be tuned for the
hardware and application. Careful attention to the details of energy
consumption at every point in the design process will be the key enabler for
dense, robust microsensor networks that deliver maximal system lifetime in
the most challenging and operationally diverse environments.

REFERENCES

[1] K. Bult et al., “Low power systems for wireless microsensors,” Proc. ISLPED ’96, pp.
17-21, August 1996.
[2] D. Estrin, R. Govindan, J. Heidemann, and S. Kumar, “Next century challenges: scalable
coordination in sensor networks,” Proc. ACM MobiCom'99, pp. 263-270, August 1999.
[3] G. Asada, et al., “Wireless integrated network sensors: low power systems on a chip,”
Proc. ESSCIRC '98, 1998.
[4] J. Kahn, R. Katz, and K. Pister, “Next century challenges: mobile networking for smart
dust,” Proc. ACM MobiCom '99, pp. 271-278, August 1999.
[5] N. Priyantha, A. Chakraborty, and H. Balakrishnan, “The cricket location-support sys-
tem,” Proc. MobiCom '00, pp. 32-43, August 2000.
[6] J. Rabaey et al., “PicoRadio supports ad hoc ultra-low power wireless networking,”
Computer, vol. 33, no. 7, July 2000, pp. 42-48
[7] F. Op’t Eynde et al., “A fully-integrated single-chip SOC for Bluetooth,” Proc. ISSCC
2001, Feb. 2001, pp. 196-197, 446.
[8] V. Tiwari and S. Malik, “Power analysis of embedded software: A first approach to soft-
ware power minimization,” IEEE Trans, on VLSI systems, Vol. 2, December 1994.
[9] R. Powers, “Advances and trends in primary and small secondary batteries,” IEEE Aero-
space and Electronics Systems Magazine, vol. 9, no. 4, April 1994 pp. 32-36.
[10] L. Nord, and J. Haartsen, The Bluetooth Radio Specification and the Bluetooth Baseband
Specification, Bluetooth, 1999-2000.
[11] Wang, W. Heinzelman, and A. Chandrakasan, “Energy-scalable protocols for battery-
operated microsensor networks,” in Proc. IEEE SiPS '99, Oct 1999.
[12] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, A Systems Perspective,
2nd edition, Reading, Mass.: Addison-Wesley, 1993, p. 236.313-317.
[13] V. De and S. Borkar, “Technology and design challenges for low power and high perfor-
mance,” Proc. ISLPED '99, pp. 163-168, August 1999.
[14] Advanced RISC Machines Ltd., Advance RISC Machines Architectural Reference
Manual, Prentice Hall, New York, 1996.
[15] R. Min, T. Furrer, and A. Chandrakasan, “Dynamic voltage scaling techniques for dis-
tributed microsensor networks,” Proc. WVLSI '00, April 2000.
[16] M. Perrott, T. Tewksbury, and C. Sodini, “27 mW CMOS Fractional-N synthesizer/mod-
ulator IC,” Proc. ISSCC 1997, pp. 366-367, February 1997.
[17] National Semiconductor Corporation, LMX3162 Evaluation Notes and Datasheet, April
1999.
372 Power-Aware Wireless Microsensor Networks

[18] J. Goodman, A. Dancy, and A.P. Chandrakasan, “An energy/security scalable encryption
processor using an embedded variable voltage DC/DC Converter,” IEEE Journal of
Solid-State Circuits, Vol. 33, No. 11, November 1998.
[19] G. Wei and M. Horowitz, “A low power switching supply for self-clocked systems,”
Proc. ISLPED 1996.
[20] K. Govil, E. Chan, and H. Wasserman, “Comparing algorithms for dynamic speed-set-
ting of a low-power CPU,” Proc. MobiCom '95, August 1995.
[21] T. Pering, T. Burd, and R. Brodersen, “The simulation and evaluation of dynamic volt-
age scaling algorithms,” Proc. ISLPED '98, August 1998.
[22] M. Bhardwaj, R. Min and A. Chandrakasan, “Power-aware systems,” Proc. of 34th
Asilomar Conference on Signals, Systems, and Computers, November 2000.
[23] N. Filiol, T. Riley, C. Plett, and M. Copeland, “An agile ISM band frequency synthesizer
with built-in GMSK data modulation,” IEEE Journal of Solid-State Circuits, vol. 33, pp.
998-1008, July 1998.
[24] P. Lettieri and M. B. Srivastava, “Adaptive frame length control for improving wireless
link throughput, range, and energy efficiency,” Proc. INFOCOM '98, pp. 564-571,
March 1998.
[25] J. Proakis, Digital Communications. New York City, New York: McGraw-Hill, 4th ed.,
2000.
[26] M. Bhardwaj, “Power-aware systems,” SM Thesis, Department of EECS, Massachusetts
Institute of Technology, 2001.
[27] W. Heinzelman, A. Chandrakasan, and H. Balakrishnan, “Energy-efficient communica-
tion protocol for wireless microsensor networks,” Proc. HICSS 2000, January 2000.
[28] K. Yao, et al., “Blind beamforming on a randomly distributed sensor array system,”
IEEE Journal on Selected Topics in Communications, Vol. 16, No. 8, October 1998.
[29] S. Haykin, J. Litva, and T. Shepherd, Radar Array Processing. Springer-Verlag, 1993.
[30] A. Sinha and A. Chandrakasan, “Energy aware software,” in Proc. VLSI Design '00, pp.
50-55, Jan. 2000.
[31] Wang, S.-H. Cho, C. Sodini, A. Chandrakasan, “Energy efficient modulation and MAC
for asymmetric RF microsensor systems,” Proc. ISLPED 2001, August 2001.
Chapter 13

Circuit and System Level Power Management

Farzan Fallah1 and Massoud Pedram2


1
Fujitsu Labs. of America, Inc.; 2 University of Southern California

Abstract: This chapter describes the concept of dynamic power management (DPM),
which is a methodology used to decrease the power consumption of a system.
In DPM, a system is dynamically reconfigured to lower the power
consumption while meeting some performance requirement. In other words,
depending on the necessary performance and the actual computation load, the
system or some of its blocks are tuned-off or their performance is lowered.
This chapter reviews several approaches to system-level DPM, including fixed
time-out, predictive shut-down or wake-up, and stochastic methods. In
addition, it presents the key ideas behind circuit-level power management
including clock gating, power gating and precomputation logic. The chapter
concludes with a description of several runtime mechanisms for leakage power
control in VLSI circuits.

Key words: Energy minimization, dynamic power management, policy optimization,


stochastic model of a power-managed system, controllable Markov decision
process, clock gating, power gating, pre-computation, leakage current control.

13.1 INTRODUCTION

With the rapid progress in semiconductor technology, chip density and


operation frequency have increased, making the power consumption in
battery-operated portable devices a major concern. High power consumption
reduces the battery service life. The goal of low-power design for battery-
powered devices is thus to extend the battery service life while meeting
performance requirements. Reducing power dissipation is a design goal even
for non-portable devices since excessive power dissipation results in
374 Circuit and System Level Power Management

increased packaging and cooling costs as well as potential reliability


problems.
Portable electronic devices tend to be much more complex than a single
VLSI chip. They contain many components, ranging from digital and analog
to electro-mechanical and electro-chemical. Much of the power dissipation
in a portable electronic device comes from non-digital components.
Dynamic power management – which refers to a selective, shut-off or slow-
down of system components that are idle or underutilized – has proven to be
a particularly effective technique for reducing power dissipation in such
systems. Incorporating a dynamic power management scheme in the design
of an already-complex system is a difficult process that may require many
design iterations and careful debugging and validation.
To simplify the design and validation of complex power-managed
systems, a number of standardization attempts have been initiated. Best
known among them is the Advanced Configuration and Power Interface
(ACPI) [1] that specifies an abstract and flexible interface between the
power-managed hardware components (VLSI chips, hard disk drivers,
display drivers, modems, etc.) and the power manager (the system
component that controls the turn-on and turn-off of the system components).
The functional areas covered by the ACPI specification are: system power
management, device power management, and processor power management.
ACPI does not, however, specify the power management policy.
It is assumed that the system modules (also called components) have
multiple modes of operation (also called power states) and that it is possible
to dynamically switch between these states. Obviously, for a given system
component, a state that uses higher power consumption will also have a
higher service speed. Furthermore, transitions between these power states
have costs in terms of energy dissipation and latency. The power-managed
system consists of a set of interacting components, some of which have
multiple power states and are responsive to commands issued by a system-
level power manager. The power manager has comprehensive knowledge of
the power states of the various components and the system workload
information. It is also equipped with means and methods to choose a power
management policy and implement it by sending state transition control
commands to the power manageable components in the system. Intuitively,
the larger is the number of power states for each component, the finer is the
control that the power manager can exert on the component’s operation.
For the DPM approach to apply, it is required that each system
component support at least two power states, ACTIVE and STANDBY. In
the ACTIVE state, the component performs computations or provides
services, while in the STANDBY state it does not perform any useful
computation or service; it only waits for an event or interrupt signal to wake
Introduction 375

it up. The power consumed in the ACTIVE state is typically much higher
than that in the STANDBY state. Therefore, putting the components in the
STANDBY state when their outputs are not being used can save power.
Because the transition from one state to another consumes some energy,
there is a minimum idle time in which power can be saved. Assume
denotes the energy consumed in the transition from the ACTIVE to the
STANDBY state, and denotes the time for this transition. and
are defined similarly. Furthermore, assume and denote the power
consumption values in the ACTIVE and the STANDBY states, respectively.
In a power-managed system, the component is switched from the ACTIVE
state to the STANDBY state if it has been idle for some period of time. The
minimum value of the idle time is calculated as:

Figure 13.1 shows two power states of a hard disk. In the ACTIVE state
(A) the power consumption is l0mW, while in the STANDBY state (S) the
power consumption is 0mW. It takes one second to switch the hard disk
from the ACTIVE state to the STANDBY state. Once in the STANDBY
state, it takes two seconds to switch back to the ACTIVE state. Note that
switching between the two states consumes power: l0mJ for switching from
the ACTIVE state to the STANDBY state and 40mJ when switching back to
the ACTIVE state. is two seconds for this system. If the idle time is less
than switching to the STANDBY state will increase the power
consumption of the system. Otherwise, it will reduce the power
consumption. If the components of a system have a high DPM may not
be effective in reducing the power.

The transition latency from the ACTIVE state to the STANDBY state
may correspond to storing register values inside the system memory to make
it possible to restore them in a later time. This requires some amount of time.
376 Circuit and System Level Power Management

There is also a transition latency associated with transitioning from the


STANDBY to the ACTIVE state. This delay is called wake-up time. The
wake-up time of a component can have several sources:
1. It might be necessary to initialize the component and restore the
values of its registers.
2. It takes some time to stabilize the voltage on the supply lines if they
were disconnected in the STANDBY state.
3. If there is a mechanical part in a component, there will be some
delay associated with restarting it. For example, hard disk drives
store data on magnetic discs. To read or write data, the disks have to
rotate. This is done by an electric motor that consumes a large
amount of power. When in the STANDBY state, the electric motor
is turned off to save power. To transition the hard disk to the
ACTIVE state, it is necessary to turn the electric motor on and wait
until the speed of the disks is stabilized. This delay is usually much
higher than the delay corresponding to the electronic circuitry. A
similar problem occurs in floppy disk and CD-ROM drives.

Because of the non-zero wake-up delay, there is a trade-off between


power consumption and the response time (or delay) of a system. The lowest
system delay is achieved by keeping all components in the ACTIVE state at
all times, which translates to high power consumption.
DPM controls the transition of the system components between the
ACTIVE and STANDBY states and seeks to minimize the power
consumption without violating the system latency or other performance
constraints.
As mentioned before, depending on the complexity of the system
components, there may be more than one STANDBY state. For example,
IBM Travelstar hard disks have five different low power states [2], Each of
them has its own power requirement and wake-up time. Lower power is
typically associated with longer wake-up time. Also, there may be more than
one ACTIVE state, each of which delivers some performance level while
consuming some amount of power. To make things even more complex, in
some systems, the performance may be specified as a value between an
upper and a lower bound. Thus, the power consumption changes accordingly
between a minimum and a maximum. Therefore, based on the actual
required performance, the power consumption can be lowered. This makes
power management for such a system very complex.
An underlying premise of the DPM algorithms is that systems and their
constituent components experience time-varying workloads. To achieve
significant power savings, a DPM algorithm has to be able to predict – with
a certain degree of accuracy when each of the system components is used
System-level Power Management Techniques 377

and turn them on and off based on the prediction. High prediction accuracy
makes it possible to significantly reduce the power at the expense of a small
increase in the latency of the system. On the other hand, if the accuracy is
low, both the latency and the power consumption of the system might
increase.
The DPM algorithm can be implemented in software and/or hardware. In
either case, there is a power associated with running the algorithm. It is very
important to take this power consumption into account while selecting an
algorithm. If the DPM algorithm is implemented in software, the load on the
core processor of the system increases. This can increase the response time
of the system. Implementing the DPM algorithm in hardware decreases the
load on the processor, but this comes at the expense of less flexibility.
DPM algorithms can be divided into two different categories: adaptive
and non-adaptive. Adaptive algorithms change the policy by which they
manage the system over time based on the change in the load of the system.
In this sense, they are capable of handling workloads that are unknown a
priori or are non-stationary. Non-adaptive algorithms use a fixed policy, that
is, they implicitly assume that, as a function of the system state, the same
decision is taken at every time instance. Adaptive algorithms tend to perform
better than non-adaptive algorithms, but they are more complex.

13.2 SYSTEM-LEVEL POWER MANAGEMENT


TECHNIQUES

This section describes various algorithms used for system-level dynamic


power management (DPM).

13.2.1 Greedy Policy

A simple DPM policy may employ a greedy method, which turns the system
off as soon as it is not performing any useful task. The system is
subsequently turned on when a new service request is received. The
advantage of this method is its simplicity, but it has the following
disadvantages:
1. The policy does not consider the energy consumed by switching from
ACTIVE to STANDBY state. Therefore, it may put the system in
STANDBY even when it has been idle for a short period of time, only to
have to turn the system back to the ACTIVE state in order to provide
service to an incoming request. This can increase the overall power
consumption of the system.
378 Circuit and System Level Power Management

2. After receiving a new service request, it often takes some time for the
system to wake up and be ready to provide the required service.
Therefore, the response time of the system increases. This increased
latency is not desirable or cannot be tolerated in many cases.

13.2.2 Fixed Time-out Policy

A simple and well-known heuristic policy is the “time-out” policy, which is


widely used in today’s portable computers. In this technique, the elapsed
time after performing the last task is measured. When the elapsed time
surpasses a threshold the system goes from ACTIVE into STANDBY
state. The critical decision is how to choose a value for Large values for
tend to result in missing opportunities for power savings. On the other
hand, small values may increase the overall system power consumption and
the response time of the system. The fixed time-out policy also suffers from
the same shortcomings as the greedy policy.

13.2.3 Predictive Shut-down Policy

Under a fixed time-out policy, the system is shut down as soon as the
elapsed time after performing the last task exceeds This means that the
system stays in the ACTIVE state for seconds and consumes a high
amount of power without performing any useful task. To decrease the power
consumption, a predictive shut-down technique can be employed as first
proposed in [3]. In this technique, the previous history of the system is used
to decide when to go from the ACTIVE to the STANDBY state. A non-
linear regression equation is used to predict the expected idle time based on
the previous behavior of the system. If the expected idle time is long enough,
the system is turned off. Otherwise, the system remains in the ACITVE state.
The disadvantage of this technique is that there is no method to
automatically find the regression equation.
Another predictive method measures the busy time of the system and
decides whether or not to shut the system down based on the measurement.
If the busy time is less than a threshold, the system is shut down. Otherwise,
it is left in the ACTIVE state. This method performs well for systems that
have burst-like loads. In such systems, short periods of operation are usually
followed by long periods of inactivity. Networks of sensors and wireless
terminals are two examples of such systems.
System-level Power Management Techniques 379

13.2.4 Predictive Wake-up Policy

The methods described so far increase the system response time. This may
not be acceptable in many cases. Hwang et al. [4] proposed the predictive
wake-up method to decrease the performance penalty. In this method if the
time spent in the STANDBY state is more than a threshold, the system goes
to the ACTIVE state. As a result there will be no performance penalty for
requests coming after the threshold. On the other hand, high amounts of
power are consumed while the system does not perform any tasks but is still
in the ACTIVE state.

13.2.5 Stochastic Methods

Heuristic policies cannot achieve the best power-delay tradeoff for a system.
They can account for the time varying and uncertain nature of the workload,
but have difficulty accounting for the variability and state dependency of the
service speeds and transition times of the many complex components that a
system may contain. Hence, it is desirable to develop a stochastic model of a
power-managed system and then find the optimal DPM policies under
various workload statistics.
The problem of finding a stochastic power management policy that
minimizes the total power dissipation under a performance constraint (or
alternatively, maximizes the system performance under a power constraint)
is of great interest to system designers. This problem is often referred to as
the policy optimization (PO) problem.
380 Circuit and System Level Power Management

13.2.5.1 Modeling and Optimization Framework

An abstract model of a power-managed system consists of four components:


Service Provider (SP), Service Requestor (SR), Service Queue (SQ), and
Power Manager (PM). Figure 13.2 shows the information and command
flow in a simple power-managed system consisting of one SP and one SR.
The SR generates service requests for the SP. The SQ buffers the service
requests. The SP provides service to the requests in a first-in first-out
manner. The PM monitors the states of the SR, SQ, and SP and issues
commands to the SP.
A power management approach based on a Markov decision process was
proposed in [5]. In this work, Benini et al. adopted a stochastic model for a
rigorous mathematical formulation of the policy optimization problem and
presented a procedure for its exact solution. The solution is computed in
polynomial time by solving a linear optimization problem. More precisely,
their approach is based on a stochastic model of power-managed service
providers and service requestors and leverages stochastic optimization
techniques based on the theory of discrete-time Markov decision processes
(DTMDP). The objective function is the expected performance (which is
related to the expected wait time for a request and thus the average number
of the requests waiting in the queue) and the constraint is the expected power
consumption (which is related to the power cost of staying in some SP state
and the energy consumption for the transfer from one SP state to the next).
Non-stationary user request rates are treated by using an adaptive stochastic
policy presented in [6]. Adaptation is based on three steps: policy
precharacterization, parameter learning, and policy interpolation. A
limitation of both of these policies is that decision-making is performed at
fixed time intervals, even when the system is inactive, thus wasting power.
In [7], Qiu et al. model the power-managed system using a controllable
Markov decision process with cost. The system model is based on
continuous-time Markov decision processes (CTMDP), which are more
suitable for modeling real systems. The resulting power management policy
is asynchronous, which is more appropriate for implementation as part of the
operating system. The overall system model is constructed exactly and
efficiently from the component models. The authors use mathematical
calculations based on the tensor product and sum of generator matrices for
the SP, SQ and SR to derive the generator matrix for the power-managed
system as a whole.12 Both linear and nonlinear programming techniques are
used to obtain exact, randomized and deterministic DPM policies. A greedy

12
Generator matrix for a CTMDP is equivalent to the transition matrix for a DTMDP.
System-level Power Management Techniques 381

algorithm known as the policy iteration is also presented for efficiently


finding a heuristic stochastic policy.
A potential shortcoming of these stochastic power management
techniques is that they tend to make assumptions about the distribution of
various events in the systems. For example, they assume that the request
inter-arrival time and the service time follow an exponential distribution.
While in some cases, these assumptions are valid, in general, they can result
in significant modeling errors. Simunic et al. in [8] proposed the time-
indexed semi-Markov decision processes that enable the modeling of system
state transitions characterized by general distributions. Similarly, Wu et al. in
[9] show how to overcome the modeling restriction on inter-arrival time of
the requests by using a “stage method,” which approximates the general
distributions using series or parallel combinations of exponential stages.
In situations where complex system behaviors, such as concurrency,
synchronization, mutual exclusion, and conflict are present, the
abovementioned modeling techniques become inadequate because they are
effective only when constructing stochastic models of simple systems. In [9]
and [10], techniques based on controllable generalized stochastic Petri nets
(GSPN) with cost are proposed that are powerful enough to compactly
model a power-managed system with complex behavioral characteristics
under quality of service constraints. It is indeed easier for the system
designer to manually specify the GSPN model than to provide a CTMDP
model. Given the GSPN model, it is then simple to automatically construct
an equivalent (but much larger) CTMDP model that is subsequently used to
solve the policy optimization problem.

13.2.5.2 A Detailed Example

A simple example is included next to explain the details of the CTMDP –


based modeling framework. Figure 13.3 shows a sample Markov process
model of a SP with six states, denoted generically by s. In particular, the SP
has four active states (Busy1, Busy2, Idle1 and Idle2), a standby state
(Standby), and a deep sleep state (Sleep). Busy1 and Busy2 states are
different only in terms of the corresponding service speeds, and power
consumptions, pwr(s). The only difference between Busy1 and Idle1 states
is a functional one, in the sense that the SP is providing service in Busy1 (or
Busy2), whereas it is sitting idle and doing nothing in Idle1 (or Idle2).
Transitions between the busy and idle states are autonomous (i.e., they are
not controllable by the SP). When the SP finishes servicing a request, it will
automatically switch from the busy state to its corresponding idle state. The
two inactive states (Standby and Sleep) are different from one another in
their power consumptions (the service speed of the SP is zero in both states)
382 Circuit and System Level Power Management

and also in the amount of required energy and the latency for transferring to
the active states.

It is assumed that the service completion events in a given state s of the


SP form a stochastic process that follows a Poisson distribution.
Consequently, the service times in state s follow an exponential distribution
with a mean value of In other words, when the SP is in state s, it
needs on average time to provide service to an incoming request.
Notice that compared to the Standby state, the Sleep state has lower power
consumption rate but results in larger energy dissipation and longer latency
to come out of it. Transitions among Idle1, Idle2, Standby and Sleep are all
controllable transitions, in the sense that the probability of making a
transition between two of these states is a function of the policy employed by
the PM. Assuming that the time needed for the SP to switch from any state
to another state also follows an exponential distribution with a mean value of
the state transition rates are given by The energy required for
this same transition is given by
Given a set of all possible actions, A, it is possible to calculate the
transition rates of the SP as follows:

where is one if is the destination state under an action a; otherwise


is set to zero.
System-level Power Management Techniques 383

Figure 13.4 shows the Markov process model of an SR with two states,
and When the SR is in state it generates a request every ms
on average. Similarly, when it is in state it generates a request every
ms. So assuming that the request inter-arrival time follows an
exponential distribution with a mean value of the request generation
rates in the two states are and respectively. Furthermore, assume
that the time needed for the SR to switch from one operation state to another
is a random variable with exponential distribution. In particular, when the
SR is in state the expected time for it to switch to state is ms (i.e.,
its transition rate is and when the SR is in state the expected time for
it to switch to state is ms (i.e., its transition rate is

Figure 13.5 shows the Markov process model of an SQ of length 3. The


model has four states, each state describing the number of waiting requests
in the SQ (i.e., from zero to three pending requests). Transitions from state
to occur at the service rate of the SP, whereas transitions from
state occur at the generation rate of the SR, If the SQ is full
and a new request arrives, the request will be rejected. The assumption here
is that the request generation rate of the SR is lower than the service rate of
the SP. This condition needs to be enforced and must be met by the system
designer (unless “lossy” service can be tolerated).
384 Circuit and System Level Power Management

The Power-Managed System (SYS) can subsequently be modeled as a


controllable CTMDP, which is obtained as a composition of the models of
the SP, SR, and SQ. A SYS state is a tuple (s, r, q). Obviously, the invalid
states where the SP is busy and the SQ is empty are excluded from the set of
system states. A set of all possible actions, A, is also given. Therefore, it is
possible to derive the transition rates between pairs of global SYS states and
under different actions, a.
A policy is simply the set of state-action pairs for all the states of the
SYS. A policy can be either deterministic or randomized. If the policy is
deterministic, then when the system is in state i at time t, an action is
chosen with probability 1. If the policy is randomized, then when the system
is in state i at time t, an action is chosen with probability such

formalism, it is straightforward to write a mathematical


program to find a policy that minimizes the expected energy dissipation of
the SYS (more precisely, the limiting average energy cost) under a constraint
on the average wait time in the SQ (see [7] for details.) Depending on how
the problem is solved, either an optimal deterministic or randomized policy
may be obtained. Notice that the policy obtained in this way is stationary
in the sense that its functional dependency on the state of the SYS does not
change over time. Obviously, this does not exclude randomized policies;
instead, it merely states that the PM uses a fixed (non-adaptive) policy based
on a priori characterization of the system workload.
In practice, a power-managed system may work in the following way.
When the SP changes states, it sends an interrupt signal to the PM. The PM
then reads the states of all components in the SYS (and hence obtains the
joint system state) and issues a command according to the chosen policy.
The SP receives the command and switches to the state dictated by the
command.
Notice that in the above example the discrete-time model would not be
able to distinguish the busy states and the idle states because the transitions
between these two sets of states are instantaneous. However, the transition
probabilities of the SP when it is in these two sets of states are different, and
therefore, such a distinction is essential in constructing an accurate stochastic
model of the power-managed system. Moreover, with a discrete-time model,
the power management program would have to send control signals to the
components in every time-slice, which results in heavy signal traffic and a
heavy load on the system resources (and therefore higher power
consumption overhead).
System-level Power Management Techniques 385

13.2.5.3 Adaptive Power Control

A typical stochastic power management policy optimization flow is as


follows:
Build stochastic models of the service requesters (SRs) and
service providers (SPs)
Construct a complete CTMDP-based model of the power-
managed system
Obtain values of model parameters by assuming a known
workload
Solve the resulting policy optimization problem and store the
optimal policy
If the workload is encountered at runtime, then employ the
corresponding policy
As stated before, a policy derived based on this framework is stationary.
To make the PM responsive to variations in the workload characteristics
(that is, develop an adaptive policy), one can use a policy decision tree. In
this tree, a non-leaf node represents a decision point. For example, it
captures the type of the application running on the system, real-time
criticality of the performance constraints, stochastic characteristics of the
workload (e.g., in terms of simple or exponentially weighted moving
averages of request inter arrival times), privacy or security requirements, etc.
The unique path from the root of the tree to any leaf node identifies specific
workload characteristics and system conditions/constraints. Each leaf node
in turn stores a fixed policy for the corresponding path.
For this two-level power management scheme to be effective, it is
essential that the runtime environment - for example, an operating system
kernel or a hardware controller monitoring (“observing”) the system and its
environment - can collect information and compile (“learn”) parameter
values needed to correctly and efficiently characterize the workload and
system conditions in order to select and “implement” a pre-computed fixed
policy. Care must be taken to ensure that the power consumption of the
“observer” is negligible and its operation is transparent, that the “learner”
can identify new workload and system conditions quickly to avoid
employing a wrong policy for the actual conditions, and that the
“implementer” can execute the required policy efficiently without disturbing
the overall system performance. Finally, if the estimated parameter values do
not exactly match values stored on any path in the policy decision tree, then
a new policy must be constructed on the fly as a hybrid of the policies whose
workloads most closely match that of the actual workload. This is a non-
trivial construction and requires further research.
386 Circuit and System Level Power Management

13.2.5.4 Battery-aware Power Management

What is missing in the above-mentioned stochastic DPM approach is lack of


knowledge about the characteristics and performance of battery sources that
power the system. As demonstrated by research results in [11], the total
energy capacity that a battery can deliver during its lifetime is strongly
related to the discharge current rate. More precisely, as the discharge current
increases, the deliverable capacity of the battery decreases. This
phenomenon is called the (current) rate-capacity effect. Another important
property of batteries, which was analyzed and modeled in [12], is named the
relaxation phenomenon or recovery effect. It is caused by the concentration
gradient of active materials in the electrode and electrolyte formed in the
discharge process. Driven by the concentration gradient, the active material
at the electrolyte-electrode interface, which is consumed by the
electrochemical reactions during discharge, is replenished with new active
materials through diffusion. Thus the battery capacity is somewhat recovered
during a no-use state. Due to these non-linear characteristics, a minimum
power consumption policy does not always necessarily result in the longest
battery service life because the energy capacity of its power sources may be
not fully exploited when the cut-off voltage of the battery is reached.
In [13] Rong et al. present a CTMDP-based model of a power-managed
battery-powered portable system with an integrated model of its battery
power source. The battery model correctly captures the two important
battery characteristics, i.e., the current-capacity effect and the recovery
effect. Furthermore, the authors consider the case of a dual-battery power
source with a battery switch that is controlled by the PM. The system model
is similar to that in Figure 13.1 but also contains two batteries Bl and B2
and a battery switch SW. The SP is powered by these two batteries, which
have different current-capacity and recovery characteristics. The SW selects
either Bl or B2 to provide power to the SP at any given time. Note that only
one of the batteries is used at a given time and the other is resting at that
time. Based on this model, it is shown that an optimal policy that would
maximize the service life (i.e., time from full charge capacity to complete
discharge of the battery power source) of the battery-powered system can be
obtained.

13.3 COMPONENT-LEVEL POWER MANAGEMENT


TECHNIQUES

Based on the source of the power that they reduce, component-level power
management techniques can be divided into the following categories:
Component-level Power Management Techniques 387

1. Dynamic power minimization techniques


2. Leakage power minimization techniques

The first category includes methods that reduce power dissipation of a


system when it is working, whereas the second set of methods decrease the
power of the system when it is in a sleep state.
It is also possible to put the techniques in three categories based on the
method they use to decrease the power consumption. These categories are:
1. Techniques that detect idleness in a system component and decrease
the power consumption by disabling that component. Examples
include clock gating and power gating.
2. Techniques that detect a change in the computational workload of
the system component and reduce the power supply voltage and
lower the clock frequency to decrease the power consumption.
Examples include dynamic voltage and frequency scaling and
dynamic threshold voltage scaling.
3. Techniques that detect a special property in the inputs of one or
more components and use it to decrease the power consumption.
Examples include use of precomputation logic and application of a
minimum leakage vector to the inputs of circuit.

13.3.1 Dynamic Power Minimization

Because dynamic power has been the dominant source of the power
dissipation in VLSI circuits and systems to date, a significant effort has been
expended on decreasing it. Dynamic power is consumed every time the
output of a gate is changed and its average value can be computed using the
following formula (assuming that all transitions are full rail-to-rail
transitions):

where C is the capacitive load of the gate, V is the supply voltage, f is the
clock frequency, and is the switching activity. To decrease the dynamic
power, any of the parameters in the formula namely capacitance, supply
voltage, frequency, and switching activity and may be reduced. In the next
subsections several techniques are introduced that decrease the dynamic
power consumption by decreasing one or more parameters in the above
formula.
388 Circuit and System Level Power Management

13.3.1.1 Clock Gating

A simple and effective method for decreasing the dynamic power


consumption is disabling the clock of a system component whenever its
output values are not used. If none of the components of a system are used,
the clock of the entire system can be turned off, and the system can go to the
STANDBY state. Clock gating decreases the power by decreasing the
switching activity in the
1. Flip-flops of a circuit
2. Gates in the fanout of the flip-flops
3. Clock tree

Figure 13.6 illustrates how clock gating can be used to decrease the
switching activity in a circuit. If the enable signal is one, the circuit works as
Component-level Power Management Techniques 389

usual. If the enable signal is zero, the value of flip-flop Q remains


unchanged. Thus, there will be no switching activity in the flip-flop or the
data path (alternatively, the controller module) in its fanout. To further
decrease the power consumption, if the value of D remains unchanged, the
clock is disabled even when the outputs of the data path or the controller are
used. This eliminates the switching activity in the internal gates of the flip-
flop.
Figure 13.7 shows an implementation of the clock gating circuitry using
an OR gate and its corresponding waveforms. Note that the flip-flop is a
positive edge-triggered flip-flop. In order for the circuit to work correctly,
GCLK has to be glitch-free. This can be achieved by stabilizing Enable
within the first half of the clock cycle if the Enable signal is glitch-free.

Figure 13.8 shows another implementation of the clock gating circuitry


using an AND gate. In this case, even when the Enable signal stabilizes
within the first half of the clock cycle, GCLK may have some glitches. To
guarantee the correct functionality of the circuit, GCLK can be inverted. In
other words, a NAND gate can be used instead of the AND gate. As another
option, a negative edge-triggered flip-flop may be used.
It is also possible to use a latch to generate a glitch-free GCLK even
when there is a ZERO or ONE hazard on the Enable line. This is depicted in
Figure 13.9. Note that even when Enable has some glitches, LEN and GCLK
are glitch-free. Therefore, using a latch is the preferred method.
As shown above, clock gating can be used to decrease the switching
activity inside a block or a flip-flop. Another possibility is to use clock
gating to disable the inverters/buffers that may be present in a clock tree.
This is especially important because of the significant amount of dynamic
power consumed in highly capacitive clock distribution trees.
390 Circuit and System Level Power Management

In Figure 13.10 making Enable equal to zero freezes the clock signal. As
a result, the switching activity on all clock drivers and modules in its fanout
is eliminated.
Up to this point it has been shown how clock gating can be done at the
gate level. It is also possible to perform clock gating in the Hardware
Description Language (HDL) specification of the circuit. In the remainder of
this section several methods for clock gating at the HDL level are described,
namely, register substitution and code separation.
Register substitution replaces registers that have enable signals with
gated-clock registers. Figure 13.11 shows part of a Verilog description and
its gated-clock version. In the gated-clock version, an always statement has
been used to generate a synchronized glitch-free enable signal (i.e., l_ena).
Component-level Power Management Techniques 391

The register is clocked using the AND of the original clock and the
generated enable signal.
392 Circuit and System Level Power Management

In the code separation method proposed by Ravaghan et al. [14], parts of the
Verilog code that are conditionally executed are identified and separated.
Then, clock gating is used for each part.

Figure 13.12 shows a part of a second Verilog description and its gated-
clock version. In the original description, the first and the second statements
inside the always loop are executed at each positive-edge clock, while the
last statement is executed conditionally. Thus, the last statement may be
separated from the rest and can be transformed using a clock gating
technique.

It is noteworthy that the clock gating transformation can decrease the


circuit area. Figure 13.13(a) shows a 32-bit register whose value is
conditionally updated. Figure 13.13(b) shows the same register after clock
gating transformation. As one can see above, a small circuit that generates
the gated clock has replaced the 32-bit multiplexer present in the original
circuit. This results in a significant reduction in the area of the circuit.

13.3.1.2 Dynamic Voltage and Frequency Scaling

Dynamic Voltage and Frequency Scaling (DVFS) is a highly effective


method to minimize the energy dissipation (and thus maximize the battery
service time in battery-powered portable computing and communication
devices) without any appreciable degradation in the quality of service (QoS).
The key idea behind DVFS techniques is to vary the voltage supply and the
clock frequency of the system so as to provide “just enough” circuit speed to
process the workload while meeting the total computation time and/or
throughput constraints and thereby reduce the energy dissipation.
Component-level Power Management Techniques 393

DVFS techniques can be divided into two categories, one for non real-
time operation and the other for real-time operation. The most important step
in implementing DVFS is prediction of the future workload, which allows
one to choose the minimum required voltage/frequency levels while
satisfying key constraints on energy and QoS. As proposed in [15] and [16],
a simple interval-based scheduling algorithm can be used in non real-time
operation. This is because there is no hard timing-constraint. As a result,
some performance degradation due to workload misprediction is allowed.
The defining characteristic of the interval-based scheduling algorithm is that
uniform-length intervals are used to monitor the system utilization in the
previous intervals and thereby set the voltage level for the next interval by
extrapolation. This algorithm is effective for applications with predictable
computational workloads such as audio or other digital signal processing
intensive applications [17]. Although the interval-based scheduling
algorithm is simple and easy to implement, it often predicts the future
workload incorrectly when a task’s workload exhibits a large variability.
One typical example of such a task is MPEG decoding. In MPEG decoding,
because the computational workload varies greatly depending on each frame
type, frequent load mispredictions may result in a decrease in the frame rate,
which in turn means a lower QoS in MPEG.
There are also many ways to apply DVFS in real-time application
scenarios. In general, some information is given by the application itself, and
the OS can use this information to implement an effective DVFS technique.
In [18], an intra-task voltage scheduling technique was proposed in which
the application code is split into many segments and the worst-case
execution time of each segment (which is obtained by static timing analysis)
is used to find a suitable voltage for the next segment. A method using a
software feedback loop was proposed in [19]. In this scheme, a deadline for
each time slot is provided. Furthermore, the actual execution time of each
slot is usually shorter than the given deadline, which means that a slack time
exists. The authors calculated the operating frequency of the processor for
the next time slot depending on the slack time generated in the current slot
and the worst-case execution time of each slot.
394 Circuit and System Level Power Management

In both cases, real-time or non real-time, prediction of the future


workload is quite important. This prediction is also the most difficult step in
devising and implementing an effective DVFS technique, especially when
the workload varies dramatically from one time instance to the next.
Figure 13.14 shows a typical architecture used in DVFS. A DC-DC
converter generates the supply voltage while the clock is generated using a
Voltage Controlled Oscillator (VCO) [20]. Supply voltage is selected based
on the required throughput or performance (i.e., if the computational load is
high, a high voltage is selected. Otherwise, the minimum voltage that
satisfies the performance requirement is chosen.)
Instead of using a supply voltage that can generate any voltage between
a minimum and a maximum, it is possible to use a quantized supply voltage
to simplify the design overhead [17].
Decreasing the supply voltage increases the delay of the circuit.
Therefore, it will be necessary to decrease the clock frequency. This is done
with a VCO. This method has been successfully used in many systems to
decrease the power consumption (see [18], [21] and [22])
Figure 13.15 illustrates the basic concept of DVFS for real-time
application scenarios. In this figure, and denote deadlines for tasks
and respectively (in practice, these deadlines are related to the QoS
requirements.) finishes at if the CPU is operated with a supply voltage
level of The CPU will be idle during the remaining (slack) time, To
provide a precise quantitative example, assume
and the CPU clock frequency at is for
some integer n; and that the CPU is powered down or put into standby with
zero power dissipation during the slack time. The total energy consumption
of the CPU is where C is the effective switched
capacitance of the CPU per clock cycle. Alternatively, may be executed
on the CPU by using a voltage level of and is thereby completed at
Assuming a first-order linear relationship between the supply voltage
Component-level Power Management Techniques 395

level and the CPU clock frequency, In the second case, the total
energy consumed by the CPU is Clearly, there is a
75% energy saving as a result of lowering the supply voltage. This saving is
achieved in spite of “perfect” (i.e., immediate and with no overhead) power
down of the CPU. This energy saving is achieved without sacrificing the
QoS because the given deadline is met. An energy saving of 89% is achieved
when scaling to and to in case of task

A major requirement for implementation of an effective DVFS technique


is accurate prediction of the time-varying CPU workload for a given
computational task. A simple interval-based scheduling algorithm is
employed in [23] to dynamically monitor the global CPU workload and
adjust the operating voltage/frequency based on a CPU utilization factor, i.e.,
decrease (increase) the voltage when the CPU utilization is low (high). Two
prediction schemes have been used in interval-based scheduling: the
moving-average (MA) and the weighted-average (WA) schemes. In the MA
scheme, the next workload is predicted based on the average value of
workloads during a predefined number of previous intervals, called window
size. In the WA scheme, a weighting factor is considered in calculating the
future workload such that severe fluctuation of the workload is filtered out,
resulting in a smaller average prediction error.
In [24], Choi et al. present a DVFS technique for MPEG decoding to
reduce the energy consumption while maintaining a quality of service (QoS)
constraint. The computational workload for an incoming frame is predicted
using a frame-based history so that the processor voltage and frequency can
be scaled to provide the exact amount of computing power needed to decode
the frame. More precisely, the required decoding time for each frame is
separated into two parts: a frame-dependent (FD) part and a frame-
independent (FI) part. The FD part varies greatly according to the type of the
396 Circuit and System Level Power Management

incoming frame whereas the FI part remains constant regardless of the frame
type. In the proposed DVFS scheme, the FI part is used as a “buffer zone” to
compensate for the prediction error that may occur during the FD part. This
scheme allows the authors to obtain a significant energy saving without any
notable QoS degradation.
Although the DVFS method is currently a very effective way to reduce
the dynamic power, it is expected to become less effective as the process
technology scales down. The current trend of lowering the supply voltage in
each generation decreases the leeway available for changing the supply
voltage. Another problem is that the delay of the circuit becomes a sub-linear
function of the voltage for small supply voltages. Hence, the actual power
saving becomes sub-quadratic.

13.3.1.3 Pre-computation

The previous sections presented several methods for decreasing the


switching activity of a circuit. The abovementioned methods took advantage
of the fact that the outputs of a circuit were not used (clock gating), were
equal to its previous value (clock gating), or the performance requirement
was low (DVFS) to decrease the dynamic power.
This section introduces a method that takes advantage of the fact that in
many cases there are some input values for which the output of the circuit
can be computed easily. In other words, in the new method the values of the
inputs are checked, and for some values the outputs are computed using a
smaller circuit rather than the main circuit. This provides an opportunity for
reducing dynamic power in the main circuit.
Component-level Power Management Techniques 397

Figure 13.16 shows a circuit f and its pre-computed implementation. In


the pre-computed circuit, the values of the inputs are checked by circuit g.
For some values, the flip-flops are not enabled. As a result, the switching
activity in circuit f is eliminated. The values of the outputs in these cases are
computed using circuit g, and they appear in the outputs using a multiplexer.
Note that despite the large overhead of this transformation, i.e., adding a
new combinational circuit, multiplexers, and flip-flops, in practice in many
cases it is possible to decrease the power consumption. The amount of power
saving depends on the complexity of circuit g and the likelihood of
observing some values in the input whose corresponding output value can be
computed by circuit g.
Due to adding a multiplexer to the circuit, the delay of the pre-computed
circuit is more than the delay of the original one. Also, driving the latch
enable signal of flip-flops using circuit g increases the delay of the circuit in
the fanin of the inputs of the flip-flops.
Figure 13.16 shows one architecture for pre-computation. Many other
architectures are possible (see [25]). For example, to decrease the overhead
of pre-computation, it is possible to only control the updating of the inputs.
398 Circuit and System Level Power Management

Figure 13.17 shows such an architecture. Suppose that for a value of


the output value is independent of the value of This means that it will not
be necessary to update the value of Keeping the value of unchanged
eliminates some switching activities in the circuit. The area overhead of this
new architecture is less than the previous one. Also, there is no multiplexer
in the output. Hence, the delay of the circuit remains unchanged. However,
the delay of the circuit driving still increases.
Pre-computation is a simple and general technique. Therefore, it can be
used to decrease the power consumption of any circuit, but the saving comes
at the price of increasing the delay and the area of the circuit.

13.3.2 Leakage Power Minimization

The current trend of lowering the supply voltage with each new technology
generation has helped reduce the dynamic power consumption of CMOS
logic gates. Supply voltage scaling increases the gate delays unless the
Component-level Power Management Techniques 399

threshold voltage of the transistors is also scaled down. The unfortunate


effect of decreasing the threshold voltage is a significant increase in the
leakage current of the transistors.
There are three main sources for leakage current:
1. Source/drain junction leakage current in Figure 13.18)
2. Gate direct tunneling leakage in Figure 13.18)
3. Sub-threshold leakage through the channel of an OFF transistor
in Figure 13.18)
The junction leakage occurs from the source or drain to the substrate
through the reverse-biased diodes when a transistor is OFF. For instance, in
the case of an inverter with low input voltage, the NMOS is OFF, the PMOS
is ON, and the output voltage is high. Subsequently, the drain-to-substrate
voltage of the OFF NMOS transistor is equal to the supply voltage. This
results in a leakage current from the drain to the substrate through the
reverse-biased diode. The magnitude of the diode’s leakage current depends
on the area of the drain diffusion and the leakage current density, which is in
turn determined by the process technology.
The gate direct tunneling leakage flows from the gate thru the “leaky”
oxide insulation to the substrate. Its magnitude increases exponentially with
the gate oxide thickness and supply voltage In fact, every 0.2nm
reduction in causes a tenfold increase in [26]. According to the 2001
International Technology Roadmap for Semiconductors (ITRS-01) [27],
high-K gate dielectric reduced direct tunneling current is required to control
this component of the leakage current for low standby power devices.
The sub-threshold current is the drain-source current of an OFF
transistor. This is due to the diffusion current of the minority carriers in the
channel for a MOS device operating in the weak inversion mode (i.e., the
sub-threshold region.) For instance, in the case of an inverter with a low
input voltage, the NMOS is turned OFF and the output voltage is high. Even
when is 0V, there is still a current flowing in the channel of the OFF
NMOS transistor due to the potential of the The magnitude of the
sub-threshold current is a function of the temperature, supply voltage, device
size, and the process parameters out of which the threshold voltage
plays a dominant role.
In current CMOS technologies, the sub-threshold leakage current is much
larger than the other leakage current components [27]. IDS can be calculated
by using the following formula:
400 Circuit and System Level Power Management

where K and n are functions of the technology, and is the drain-induced


barrier lowering (DIBL) coefficient [28]. For large and small
becomes independent of the value.

Clearly decreasing the threshold voltage increases the leakage current


exponentially. In fact decreasing the threshold voltage by 100 mv increases
the leakage current by a factor of 10. Decreasing the length of transistors
increases the leakage current as well. Therefore, in a chip, transistors that
have smaller threshold voltage and/or length due to process variation
contribute more to the overall leakage. Although previously the leakage
current was important only in systems with long inactive periods (e.g.,
pagers and networks of sensors), it has become a critical design concern in
any system in today’s designs.
The leakage current increases with temperature. Figure 13.19 shows the
leakage current for several technologies for different temperatures. As one
can see, the leakage current grows in each generation. Furthermore, in a
given technology, the leakage current increases with the temperature.
Component-level Power Management Techniques 401

Figure 13.20 shows the power consumption of a 15mm die fabricated in


a technology with a supply voltage of 0.7V. Although the leakage
power is only 6% of the total power consumption at 30°C, it becomes 56%
of the total power at 110°C. This clearly shows the necessity of using
leakage power reduction techniques in current designs.
In this section three methods for decreasing the leakage power are
described. In the first method, which is called power gating, the power of a
circuit or some of its blocks is removed. In the second method, the threshold
voltage of transistors is changed using the body bias. The third method takes
advantage of the fact that the leakage is a strong function of input values to
decrease the leakage power. All methods can be used only if the system is in
a sleep state (e.g., the STANDBY state.)

13.3.2.1 Power Gating

The most natural way of lowering the leakage power dissipation of a VLSI
circuit in the STANDBY state is to turn off its supply voltage. This can be
done by using one PMOS transistor and one NMOS transistor in series with
the transistors of each logic block to create a virtual ground and a virtual
power supply as depicted in Figure 13.21. Notice that in practice only one
transistor is necessary. Because of their lower on-resistance, NMOS
transistors are usually used.
402 Circuit and System Level Power Management

In the ACTIVE state, the sleep transistor is on. Therefore, the circuit
functions as usual. In the STANDBY state, the transistor is turned off, which
disconnects the gate from the ground. Note that to lower the leakage, the
threshold voltage of the sleep transistor must be large. Otherwise, the sleep
transistor will have a high leakage current, which will make the power gating
less effective. Additional savings may be achieved if the width of the sleep
transistor is smaller than the combined width of the transistors in the pull-
down network. In practice, Dual CMOS or Multi-Threshold CMOS
(MTCMOS) is used for power gating [29]. In these technologies there are
several types of transistors with different values. Transistors with a low
are used to implement the logic, while devices are used as sleep
transistors.
To guarantee the proper functionality of the circuit, the sleep transistor
has to be carefully sized to decrease its voltage drop while it is on. The
voltage drop in the sleep transistor decreases the effective supply voltage of
the logic gate. Also, it increases the threshold of the pull-down transistors
due to the body effect. This increases the high-to-low transition delay of the
circuit. Using a large sleep transistor can solve this problem. On the other
hand, using a large sleep transistor increases the area overhead and the
dynamic power consumed for turning the transistor on and off. Note that
because of this dynamic power consumption, it is not possible to save power
for short idle periods. There is a minimum duration of the idle time below
which power saving is impossible. Increasing the size of the sleep transistors
increases this minimum duration.
Component-level Power Management Techniques 403

Since using one transistor for each logic gate results in a large area and
power overhead, one transistor may be used instead of each group of gates as
depicted in Figure 13.22.
The size of the sleep transistor in Figure 13.22 should be larger than the
one used in Figure Figure 13.21. To find the optimum size of the sleep
transistor, it is necessary to find the vector that causes the worst case delay in
the circuit. This requires simulating the circuit under all possible input
values, a task that is not possible for large circuits.
In [29], the authors describe a method to decrease the size of sleep
transistors based on the mutual exclusion principle. In their method, they
first size the sleep transistors to achieve delay degradation less than a given
percentage for each gate. Notice that this guarantees that the total delay of
the circuit will be degraded by less than the given percentage. In fact the
actual degradation can be as much as 50% smaller. The reason for this is that
sleep transistors degrade only the high-to-low transitions and at each cycle
only half of the gates switch from high to low. Now the idea is that if two
gates switch at different times (i.e., their switching windows are non-
overlapping), then their corresponding sleep transistors can be shared.
404 Circuit and System Level Power Management

Consider the inverters in Figure 13.23. These inverters switch at different


times due to their propagation delays. Therefore, it is possible to combine
their sleep transistors and use one transistor instead of three. In general, if
there are n logic gates whose output transition windows are non-intersecting,
and each has a sleep transistor whose width is then these sleep transistors
may be replaced with a single transistor whose width is for
Notice that this will decrease the delay degradation of the logic
gates whose corresponding sleep transistors are narrower than
Furthermore, if there are several sleep transistors corresponding to some
logic gates with overlapping output transition windows, then these sleep
transistors may be replaced by a single transistor whose width is

Using mutual exclusion at the gate level is not practical for large circuits.
To handle large circuits, the mutual exclusion principle may be used at a
larger level of granularity. In this case, a single sleep transistor is used for
each module or logic block. The size of this sleep transistor is calculated
according to the number of logic gates and complexity of the block. Next the
sleep transistors for different blocks are combined as described before. This
method enables one to “hide” the details of the blocks thus large circuits can
be handled. However, in this case, the sizes of sleep transistors may be sub-
optimal.
Power gating is a very effective method for decreasing the leakage
power. However, it suffers from the following drawbacks:
Component-level Power Management Techniques 405

1. It requires modification in the CMOS technology process to support


both a high device (for the power switch) and a low device
(for logic gates.)
2. It decreases the voltage swing; therefore, it decreases the DC noise
margin.
3. Supply voltage scale-down makes it necessary to decrease the
threshold voltage of the sleep transistors in each generation. This
means that the leakage current will continue to increase
exponentially with each generation.
4. Scaling-down the supply voltage decreases the drive on all
transistors. As a result, the on-resistance of the transistors increases.
This increase is greater for sleep transistors because of their higher
threshold voltage. As a result, larger sleep transistors should be used,
which increases the area overhead of this approach.
5. Sleep transistor sizing is a non-trivial task and requires much effort.
6. Power gating cannot be used in sequential circuits (unless the circuit
state is first saved and subsequently restored) because turning the
supply off results in a loss of data stored in the memory elements.

13.3.2.2 Body Bias Control

One of the methods proposed for decreasing the leakage current is using
reverse-body bias to increase the threshold voltage of transistors in the
STANDBY state [30]. The threshold voltage of a transistor can be calculated
from the following standard expression,

where is the threshold voltage for is the substrate Fermi


potential, and the parameter is the body-effect coefficient [31]. As one can
see, reverse biasing a transistor increases its threshold voltage. This results in
a decrease in the leakage current of the transistor. This method requires a
triple-well technology, which may not always be available. Because the
threshold voltage changes with the square root of the reverse bias voltage, a
large voltage may be necessary to get a small increase in the threshold
voltage. As a result, this method becomes less effective as the supply voltage
is scaled down.
As stated previously, the leakage current has two main components:
and The second component is typically much larger than the first one.
Increasing the substrate voltage decreases but it increases
exponentially. This suggests there is an optimum substrate voltage for which
406 Circuit and System Level Power Management

the leakage current is at a minimum. The optimum substrate voltage


decreases by a factor of two, and the leakage reduction becomes less
effective by a factor of four in each technology generation [32]. Therefore,
this method may not be effective in future technology generations.
Alternatively, it is possible to use forward-body bias to decrease the
threshold voltage [33]. In the STANDBY state, zero substrate bias is used to
have a high for low leakage. To decrease the gate delays while in the
ACTIVE state, the threshold voltage is decreased by using a forward-body
bias. This method has been successfully used to reduce the leakage power of
a router chip by a factor of 3.5.
Finally, as in dynamic supply voltage scaling, it is possible to use a
dynamically-varying threshold voltages. The idea is to adjust the of
devices by using dynamic body bias control. It is shown in [34] that the
leakage current of an MPEG-4 chip can be driven below 10mA in ACTIVE
state and below in STANDBY state, independently of the default
of the devices and the temperature.

13.3.2.3 Minimum Leakage Vector Method

The leakage current of a logic gate is a strong function of its input values.
The reason is that the input values affect the number of OFF transistors in
the NMOS and PMOS networks of a logic gate.
Table 13.2 shows the leakage current of a two input NAND gate built in
a CMOS technology with a 0.2V threshold voltage and a 1.5V
supply voltage. Input A is the one closer to the output of the gate.

The minimum leakage current of the gate corresponds to the case when
both its inputs are zero. In this case, both NMOS transistors in the NMOS
network are off, while both PMOS transistors are on. The effective
resistance between the supply and the ground is the resistance of two OFF
NMOS transistors in series. This is the maximum possible resistance. If one
of the inputs is zero and the other is one, the effective resistance will be the
same as the resistance of one OFF NMOS transistor. This is clearly smaller
Component-level Power Management Techniques 407

than the previous case. If both inputs are one, both NMOS transistors will be
on. On the other hand, the PMOS transistors will be off. The effective
resistance in this case is the resistance of two OFF PMOS transistors in
parallel. Clearly, this resistance is smaller than the other cases.
In the NAND gate of Table 13.2 the maximum leakage is about three
times higher than the minimum leakage. Note that there is a small difference
between the leakage current of the A=0, B=1 vector and the A=1, B=0
vector. The reasons are the difference in the size of the NMOS transistors
and the body effect. This data in fact describes the “stack effect” i.e., the
phenomenon whereby the leakage current through a stack of two or more
OFF transistors is significantly smaller than a single device leakage.
Other logic gates exhibit a similar leakage current behavior with respect
to the applied input pattern. As a result, the leakage current of a circuit is a
strong function of its input values. Abdollahi et al. [35] use this fact to
reduce leakage current. They formulate the problem of finding the minimum
leakage vector (MLV) using a series of Boolean Satisfiability problems.
Using this vector to drive the circuit while in the STANDBY state, they
reduce the circuit leakage. It is possible to achieve a moderate reduction in
leakage using this technique, but the reduction is not as high as the one
achieved by the power gating method. On the other hand, the MLV method
does not suffer from many of the shortcomings of the other methods. In
particular,
1. No modification in the process technology is required.
2. No change in the internal logic gates of the circuit is necessary.
3. There is no reduction in voltage swing.
4. Technology scaling does not have a negative effect on its
effectiveness or its overhead. In fact the stack effect becomes
stronger with technology scaling as DIBL worsens.
The first three facts make it very easy to use this method in existing designs.
Further reduction in leakage may be achieved by modifying the internal
logic gates of a circuit. Note that due to logic dependencies of the internal
signals, driving a circuit with its MLV does not guarantee that the leakage
currents of all its logic gates are at minimum values. Therefore, when in the
STANDBY state, if, by some means, values of the internal signals are also
controlled, even higher leakage savings can be achieved. One way to control
the value of an internal signal (line) of a circuit is to replace the line with a
2-to-l multiplexer [36]. The multiplexer is controlled by the SLEEP signal
whereas its data inputs are the incoming signal and either a ZERO or ONE
value decided by the leakage current minimization algorithm. The output is
the outgoing signal. Since one input of the multiplexer is a constant value,
the multiplexer can be replaced by an AND or an OR gate. Figure 13.24
408 Circuit and System Level Power Management

shows a small circuit and its modified version where the internal signal line
can explicitly be controlled during the STANDBY state.

In Figure 13.24(b), when the circuit is in STANDBY state, the output of


the AND gate is ZERO; if a ONE on that line is desired, the AND gate has
to be replaced by an OR gate. Note that extra gates added to the circuit
consume leakage power. Therefore, replacing all internal lines with
multiplexers or gates will increase the leakage. The problem of determining
which lines to replace and also finding the MLV for primary inputs and the
selected internal signals can be formulated using a series of Boolean
Satisfiability problems and solved accordingly as shown in [36].
Summary 409

Another way of controlling the value of the internal signals of a circuit is


modifying its gates. Figure 13.25 shows two ways of modifying a CMOS
gate. In both cases a transistor is added in series with one of the N or P
networks. This decreases the gate leakage because of the transistor stack
effect. The percentage of the reduction depends on the type of the gate. In
addition, as mentioned before, this modification makes it possible to control
the values of the internal lines in the circuit thus decreasing the leakage
current of the gates in the fanout of the modified gate.
Clearly, adding transistors to gates increases the delay of the circuit. The
problem of finding the minimum leakage vector and the optimal set of gates
to be modified in order to minimize the leakage of the circuit under a delay
constraint can be formulated as a series of Boolean Satisfiability problems
and solved accordingly [36].

13.4 SUMMARY

This chapter introduced the concept of dynamic power management to


selectively turn off or reduce the performance of some blocks of a system so
as to decrease the system-level power consumption. Several policies for
dynamic power management were described. It was also described how
blocks of a system could be modified to make them power manageable.
More specifically, several techniques for reducing dynamic power and
leakage power were introduced. The effectiveness of many of these
techniques diminishes as the technology scales down. Therefore, it becomes
necessary to find new techniques as we enter the ultra deep sub-micron era.
Furthermore, introduction of new applications may invalidate many
assumptions made while designing dynamic power-management policies and
will make it necessary to find new policies. Dynamic power management is
currently in its infancy and will most likely remain an active area of research
in the foreseeable future.

ACKNOWLEDGEMENT

The authors would like to thank Afshin Abdollahi, Kihwan Choi, Chang-
woo Kang, Peng Rong, and Qing Wu for their contributions to this chapter.
410 Circuit and System Level Power Management

REFRENCES
[1] Intel, Microsoft, Toshiba, Advanced configuration and power interface specification,
http://www.acpi.info/.
[2] IBM, "2.5-Inch Travelstar Hard Disk Drive," 1998.
[3] M. Srivastava, A. P. Chandrakasan, R. W. Brodersen, “Predictive system shutdown and
other architectural techniques for energy efficient programmable computation,” IEEE
Trans. on VLSI Systems, vol. 4, no. 1, pp. 42-55,1996.
[4] C. Hwang, A. C.-H. Wu, “A predictive system shutdown method for energy saving of
event-driven computation,” Proc. International Conference on Computer-Aided Design
of Integrated Circuits and Systems, Vol. 16, pp. 28-32, November 1997.
[5] L. Benini, A. Bogliolo, G. A. Paleologo and G. De Micheli, “Policy optimization for
dynamic power management,” IEEE Trans. on Computer-Aided Design of Integrated
Circuits and Systems, Vol. 18, pp. 813-833, June 1999.
[6] E. Chung, L. Benini and G. De Micheli, “Dynamic power management for non
stationary service requests,” Proc. Design and Test in Europe Conference, March 1999,
pp. 77-81.
[7] Q. Qiu, Q. Wu and M. Pedram, “Stochastic modeling of a power-managed system-
construction and optimization,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, pp. 1200-1217, October 2001.
[8] Simunic T, Benini L, Glynn P, De Micheli G. “Event-driven power management,” IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 20,
pp. 840-857, July 2001.
[9] Q. Wu, Q. Qiu, and M. Pedram, “Dynamic power management of complex systems
using generalized stochastic Petri nets,” Proc. Design Automation Conference, June
2000, pp. 352-356.
[10] Q. Qiu, Q. Wu and M. Pedram, “Dynamic power management in a mobile multimedia
system with guaranteed quality-of-service,” Proc. Design Automation Conference,
June 2001, pp. 834-839.
[11] M. Pedram and Q. Wu, "Design considerations for battery-power electronics," Proc.
Design Automation Conference, June 1999, pp. 861-866.
[12] Thomas F. Fuller, Marc Doyle and John Newman, "Relaxation phenomena in lithium-
ion-insertion cells," Journal of Electrochemical Society, Vol. 141, April 1994.
[13] P. Rong and M. Pedram, “Battery-aware power management based on CTMDPs,”
Technical Report, Department of Electrical Engineering, University of Southern
California, No. 02-06, May 2002.
[14] N. Raghavan, V. Akella and S. Bakshi, “Automatic insertion of gated clocks at register
transfer level,” Proc. 12th International Conference on VLSI Design, January 1999.
[15] M. Weiser, B. Welch, A. Demers, and S. Shenker, “Scheduling for reduced CPU
energy,” in Proc. First Symposium on Operating Systems Design Implementation, 1994,
pp. 13-23.
[16] K. Govil, E. Chan, and H. Wasserman, “Comparing algorithms for dynamic speed-
setting of a low power CPU,” Proc. First International Conference on Mobile
Computing Networking, 1995, pp. 13-25.
[17] A. Chandrakasan, V. Gutnik, and T. Xanthopoulos, “Data driven signal processing: an
approach to energy efficient computing,” Proc. International Symposium on Low Power
Electronics and Design, August 1996, pp.347-352.
[18] D. Shin, J. Kim, and S. Lee, “Low-energy intra-task voltage scheduling using static
timing analysis,” Proc. Design Automation Conference, June 2001, pp. 438-443.
Summary 411

[19] S. Lee and T. Sakurai, “Run-time power control scheme using software feedback loop
for low-power real-time applications,” Proc. Asia South-Pacific Design Automation
Conference, January 2000, pp. 381-386.
[20] B. Razavi, RF Microelectronics, Prentice Hall, 1997.
[21] O. Y-H Leung , C-W Yue , C-Y Tsui, R. S. Cheng, “Reducing power consumption of
turbo code decoder using adaptive iteration with variable supply voltage,” Proc.
International Symposium on Low Power Electronics and Design, August 1999, pp. 36-
41.
[22] F. Gilbert, A. Worm, N. When, “Low power implementation of a turbo-decoder on
programmable architectures,” Proc. Asia South-Pacific Design Automation Conference,
January 2001, pp. 400-403.
[23] T. Pering, T. Burd, and R. Broderson, “The simulation and evaluation of dynamic
voltage scaling algorithms,” Proc. International Symposium on Low Power Electronics
and Design, August 1998, pp.76-81.
[24] K. Choi, K. Dantu and M. Pedram, “Frame-based dynamic voltage and frequency scaling
for a MPEG decoder,” Technical Report, Department of Electrical Engineering,
University of Southern California, No. 02-07, May 2002.
[25] M. Alidina, J. Monteiro, S. Devadas, A. Ghosh, and M. Papaefthymiou,
“Precomputation-Based Sequential Logic Optimization for Low Power,” Proc.
International Conference on Computer-Aided Design, November 1994, pp. 74-81.
[26] C-F. Yeap, “Leakage current in low standby power and high performance devices: trends
and chlaanges,” Proc. International Symposium on Physical Design, April 2002, pp. 22-
27.
[27] Semiconductor Industry Association, International Technology Roadmap for
Semiconductors, 2001 edition, http://public.itrs.net/.
[28] B. Sheu, D. Scharfetter, P. Ko, and M. Jeng, "BSIM: Berkeley short-channel IGFET
model for MOS transistors," IEEE Journal of Solid State Circuits, Vol. 22, August 1987,
pp. 558-566.
[29] J. T. Kao, A. P. Chandrakasan, "Dual-threshold voltage techniques for low-power digital
circuits,” IEEE Journal of Solid-State Circuits, Vol. 35, July 2000, pp. 1009-1018.
[30] K. Seta, H. Hara, T. Kuroda, et al., “50% active-power saving without speed degradation
using standby power reduction (SPR) circuit,” IEEE International. Solid-State Circuits
Conf., February 1995, pp. 318-319.
[31] S-M. Kang and Y. Lelebici, CMOS Digital Integrated Circuits, Mc Graw Hill, second
edition, 1999.
[32] A. Keshavarzi, S. Narendra, S. Borkar, V. De, and K. Roy, “Technology scaling
behavior of optimum reverse body bias for standby leakage power reduction in CMOS
IC's,” Proc. International Symposium on Low Power Electronics and Design, August
1999, pp. 252-254.
[33] V. De and S. Borkar, “Low power and high performance design challenges in future
technologies,” Proc. the Great Lakes Symposium on VLSI, 2000, pp. 1 -6.
[34] T. Kuroda, T. Fujita, F. Hatori, and T. Sakurai, “Variable threshold-voltage CMOS
technology,” IEICE Transactions. on Fundamentals of Electronics, Communications and
Computer Sciences, vol. E83-C, November 2000, pp. 1705-1715.
[35] A. Abdollahi, F. Fallah, M. Pedram, “Minimizing leakage current in VLSI circuits,”
Technical Report, Department of Electrical Engineering, University of Southern
California, No. 02-08, May 2002.
412 Circuit and System Level Power Management

[36] A. Abdollahi, F. Fallah, M. Pedram, “Runtime mechanisms for leakage current reduction
in CMOS VLSI circuits,” Proc. International Symposium on Low Power Electronics and
Design, August 2002.
Chapter 14
Tools and Methodologies for Power Sensitive Design

Jerry Frenkil
Sequence Design, Inc.

Abstract: The development of power efficient devices has become increasingly


important in a wide variety of applications, while at the same time technology
advances following Moore’s Law have led to faster and more complex circuits
which consume ever increasing amounts of power. While power consumption
issues have historically been treated by lowering the supply voltage, the issues
have become sufficiently challenging and complex as to require much more
design attention and significant amounts of design automation. This chapter
discusses the various types of design automation available to address power
consumption issues, and presents a comprehensive design flow that
incorporates multiple levels and types of power sensitive design automation.

Key words: Low power design, power analysis, power estimation, power optimization,
computer aided design, power sensitive design, power modeling, power tools.

14.1 INTRODUCTION

The demand for battery-powered products is creating immense interest in


energy efficient design. Meanwhile, integrated-circuit densities and
operating speeds have continued to climb, following Moore’s Law in
unabated fashion. The result is that chips are becoming larger, faster, and
more complex and because of this, consuming ever-increasing amounts of
power.
These increases in power pose new and difficult challenges for integrated
circuit designers. While the initial response to increasing levels of power
consumption was to reduce the supply voltage, it quickly became apparent
that this approach was insufficient. Designers subsequently began to focus
414 Tools and Methodologies for Power Sensitive Design

on advanced design tools and methodologies to address the myriad power


issues, many of which had previously been second order effects but have
now become first order design concerns. The list of these issues is now
lengthy: power supply sizing, junction temperature calculation,
electromigration sensitivity calculation, power grid sizing and analysis,
package selection, noise margin calculation, timing derating, and macro-
model generation, to name a few.
Complicating designers’ attempts to deal with these issues are the
complexities – logical, physical, and electrical – of contemporary IC designs
and the design flows required to build them. Multi-million gate designs are
now common, incorporating embedded processors, DSP engines, numerous
memory structures, and complex clocking schemes. Multiple supply
voltages, high speed clocks, and sophisticated signalling result in a complex
electrical environment, which in turn requires substantial amounts of
automation to support accurate design and analysis efforts. In addition,
different measurement types must be supported to address a variety of power
data applications.
This chapter discusses the various types of design automation that focus
on Power Sensitive Design and should be of interest to designers of power
efficient devices, IC design engineering managers, and EDA managers and
engineers. The chapter begins with a design automation view of power: an
overview of power issues requiring automation, power modeling, power
views at different abstraction levels, and examples of industry standards.
Next, a survey is presented, by abstraction level, of the different types of
power tools for different applications, including representative examples of
commercially available tools. Following the survey, a Power Sensitive
Design flow is presented illustrating the use of the tools previously
described. The chapter concludes with a view to the future of looming
issues and likely design automation solutions for those issues.

14.2 THE DESIGN AUTOMATION VIEW

Power Sensitive Design Automation seeks to add power as a third dimension


to the other two dimensions of IC design, performance and area. Power
Sensitive Design Automation takes many forms such as analysis,
optimization, and modeling and can be applied at many different points in
the design process [1]. But whatever the form and whenever used, the
overall objective is to assist designers confronting power related issues in
much the same manner as they address timing and area concerns.
The Design Automation View 415

14.2.1 Power Consumption Components

In order to calculate the power consumption13 of integrated circuits, certain


types of information are required: physical data (such as wiring capacitances
and transistor descriptions), activity information (such as which nodes toggle
and how often), and electrical data (such as power supply voltages and
current flows). This data is used to calculate power consumption according
to the following equation.

Here P represents the total power consumed, V represents the supply


voltage, I denotes the current drawn from the supply, C represents
capacitance, and f represents the operating frequency. The VI term
represents the static, or DC power consumption, while the term
represents the dynamic power consumption.
With regards to CMOS integrated circuit design, this equation expands
into the following.

Here represents the total nodal capacitance (including output driver


capacitance, wire capacitance, and fanout capacitance), f represents the
switching frequency, represents the supply voltage, represents the
signal voltage swing (which for CMOS is usually equal to
represents charge due to the short-circuit momentary current (also known as
crow-bar current) drawn from the supply during switching events,
represents the parasitic leakage current, and represents the (by design)
quiescent static current. The first two terms of this equation represent the
dynamic power consumption, while the latter two represent the static power
consumption.
Note the structure of the term. This power component is often
represented as however, such a formulation does not reflect the fact
that this is not a continuously flowing current but rather is a momentary, or
pulsed, current and thus should not have a DC form. Here f represents the
frequency or rate at which it is pulsed.

13
It should be noted that power consumption and power dissipation are not synonymous. For
a more detailed discussion of the differences, please refer to the section on power
measurement types. All discussions of power in this chapter will refer to power
consumption. It should also be noted that static CMOS is the target semiconductor
technology and circuit topology for the calculations and tools described in this chapter.
416 Tools and Methodologies for Power Sensitive Design

Consider the inverter shown in Figure 14.1. The load capacitance, is


charged by the current this is represented by the term, where
the f represents the rate at which the load is charged. The current flows
whenever the inverter changes state and is the result of both the n-channel
pull-down and the p-channel pull-up both being in the on-state momentarily.

Examples of and are shown in Figure 14.2. Here


represents the current that flows through the circuit by design – in this case
is the bias current of the current source. Examples of circuits in which
through currents occur are analog circuits such as comparators, PLLs, and
sense-amplifiers. By contrast represents the parasitic leakage currents
that flow from the power supply to ground, even when the transistor is in
cutoff. While there are many components to the [2], for modeling
purposes they can generally all be abstracted to a single current represented
by
Equation (14.2) effectively represents the power consumed by a single
element, such as an inverter, driving a capacitive load. But ICs contain
The Design Automation View 417

millions of elements and nodes, and equation (14.2) must be used to


calculate power for each of those.

Thus, equation (14.3) results when the calculation represented by


equation (14.2) is summed over all the elements in the design.
In order to understand power consumption at a full chip level and given
the large number of elements in recent designs, it is often helpful to classify
the various components of power consumption by type. A taxonomy of
power consumption, illustrated in Figure 14.3, is often employed by power
tools when reporting the power consumption details for a particular design.

14.2.2 Different Types of Power Tools

Power tools can generally be classified along two axes, the function and
abstraction layers. Function refers to the expected operation of the tool, such
as analysis, modeling, or optimization, while abstraction level refers to the
level of detail present in the input design description.
418 Tools and Methodologies for Power Sensitive Design

Along the function axis, the most fundamental tool is the analysis tool.
This type of tool estimates or calculates the amount of power consumed by
the design of interest. An analysis tool may be used alone or it may be used
as the internal calculation engine for other types of power tools, such as
optimizers, modelers, or other derivative power tools. For example, an
optimizer takes a design and, while holding functionality and performance
constant, makes various transformations to reduce power. In most cases,
optimizers need an internal analysis engine in order to judge whether or not a
given transformation actually reduces power. Modelers utilize an analysis
engine internally in order to compute a circuit’s power characteristics to
produce a power model for a higher abstraction level. The fourth category is
derivatives. These types of tools target the effects of power on other
parameters, such as the current flow through power distribution networks or
the effects of power on circuit timings.
Each of these functions may be performed on different design
abstractions such as transistor, gate (or logic), register transfer (or RTL),
behavior, or system. Here abstraction refers specifically to the level at
which the design is described. For example, a netlist of interconnected
transistors is at the transistor level while a netlist of interconnected logic
primitives, such as NAND gates, flip-flops, and multiplexors is at the gate
level. A design at the RT-level is written in a hardware description language
such as Verilog or VHDL with the register storage explicitly specified, thus
functionality is defined per clock cycle. A behavioral description may also
be written in Verilog, VHDL, or even C, but in this case the abstraction is
“higher” as register storage is implied and functionality may be specified
across clock cycles, or without any reference to a clock cycle at all. The
highest abstraction level is the system level. At this level many of the details
are omitted but the functionality is described by the interrelationship of a
number of high-level blocks or functions.

14.2.3 Power Tool Data Requirements

At whatever abstraction level the tool operates, the data input requirements
are generally the same, although the forms of the data will vary. Equation
(14.3) shows that both technology-related information such as capacitances
and currents and environmental information such as voltages and activities
are required, in addition to the design itself. However, the form of the data,
and which data is input and which data is derived, varies according to how
the specific tools operate.
The Design Automation View 419

14.2.3.1 Design Data

Design data is represented in one of several different standard formats. For


transistor level designs, the data is represented as an interconnection of
transistors in a SPICE, or SPICE-like interconnection format [3]. Gate level
designs are similarly represented as an interconnection of logic model
primitives, with Verilog being the most prevalent format.
Higher-level descriptions most often are also represented as an
interconnection of logic functions, but in these cases the functions are much
more complex and abstract. These designs are also often represented in
Verilog, but VHDL, C, and C++ are also used, especially at the behavior and
system levels.

14.2.3.2 Environmental Data

A given design will consume differing amounts of power depending upon its
environment. For example, a particular microprocessor running at 100 MHz
in one system will consume less power than the same microprocessor
running in a different system at 150 MHz.
Environmental data can be grouped into three major categories: voltage
supplies, activities, and external loads. Data in each of these categories must
be specified in order to accurately calculate a design’s power consumption.
Supply voltage is represented by the term in equation (14.3) and is
usually specified as a global parameter. Some designs may utilize multiple
supply voltages in which case a different value of must be assigned to
each section as appropriate.
The capacitive loads of a design’s primary outputs, represented by the
term in equation (14.3), must also be specified. As the values for these loads
can be rather large, often in the range of 20 to 100 pf, these capacitances can
contribute a substantial amount to a design’s total power consumption.
Activity data is represented by the f in equation (14.3). For transistor
level tools, the activities of the circuit’s primary inputs are specified as
waveforms with particular shapes, timings, and voltages. The activities of
intermediate nodes are derived by the tools as a function of the circuit’s
operation, with the power calculations being performed concurrently.
Higher-level tools also require the specification of the activities on the
primary inputs, but only the frequency or toggle counts are required.
However, in addition these tools require the activities on all the other nodes
in the design, and this data is usually generated by a separate logic
simulation of the design. For these higher-level power tools, the power is
calculated by post-processing the nodal activity data that was previously
420 Tools and Methodologies for Power Sensitive Design

generated by a functional or logical simulation. Figure 14.4 below


illustrates the generation of this data.
Logic simulators produce value change dump files (also known as .vcd
files) that contain data describing when each monitored node transitions and
in which direction it transitions (from low to high or high to low, for
example). The dump file containing this data is produced via the use of
standard simulator dump routines or, in some cases, by more efficient
customized simulator PLI calls. The activity files, by contrast, are reduced
versions of the .vcd files; the reduction produces, for each node, a count of
how many transitions occurred and the effective aggregate duty cycle of
those transitions. An example of an entry in an activity file would be as
follows:

Top.m1.u3 0.4758 15 14

Here, the first field contains the node name, the second field contains the
effective duty cycle, the third field contains the number of rising transitions,
and the fourth field contains the number of falling transitions.

The primary motivation for using activity data instead of .vcd data is that of
file size; .vcd files can easily require gigabytes of storage. On the other
hand, activity files are less useful in calculating instantaneous currents since
they do not maintain the temporal relationships between signals as is done
with .vcd data. While all common HDL simulators produce .vcd files
directly, few produce activity files directly, in which case the .vcd data must
The Design Automation View 421

be converted into activity data format either through the use of external
utilities or power tools’ internal conversion routines.

14.2.3.3 Technology Data & Power Models

The remaining parameters in equation (14.3), and are all


considered to be technology or power data.
For transistor level tools, these parameters are derived from the transistor
definitions and process parameters while the simulation is taking place. For
the higher-level tools, this information is obtained from previously generated
models. The models may be generated by manual or by automatic means,
but in either case the raw data is usually generated by transistor level tools
(this process will be covered in a subsequent section) and abstracted into a
black box power model, known as a power macro-model.
It is important to note that the existence of complete and accurate macro-
models forms the foundation of a cell-based design methodology; this is true
in both the timing and power domains. Incomplete or limited accuracy
models limit the overall accuracy and efficacy of the entire design flow as
they, in essence, would be the weak link in the design automation chain.
A typical power macro-model will contain the following information:
cell name, functional definition, pin names, pin direction, pin capacitance,
energy per event, and power (or current) per state. As a simple example,
consider a standard CMOS inverter. This cell has two pins (I, ZN) two
different states (input=0, input=l), and two different energy-consuming
events (input=rising, input=falling). Thus a generic model for this inverter
might look like this:

cell {
Name: INV
Function: ZN = !I
Pin { Name = I; Direction = in; Capacitance = cap in F }
Pin { Name = ZN; Direction = out; Capacitance = cap in F}
Power_Event { Condition = 01 I; Energy = energy in J }
Power_Event { Condition = 10 I; Energy = energy in J }
Power_state { Condition = 0 I; Power = power in W }
Power_state { Condition = 1 I; Power = power in W }
}
14
in its full composition is partly technology data and partly design data (except for the
case of a primary chip output, as described above, when the total load is offchip – in this
case is considered to be environmental data). Consider the fanout capacitance for a
given would be determined by the sum of the fanout input capacitances. The
fanout number comes from design data while the amount of input capacitance per fanout is
considered to be technology data.
422 Tools and Methodologies for Power Sensitive Design

In a macro-model such as this, the cell is viewed as a black-boxed


primitive with a complete description of its power-consuming
characteristics. Each of its inputs has a capacitance associated with it so that
a tool can calculate the capacitive load presented to the cells that drive it.
Internal currents, such as and are all encapsulated in the
Power_Event and Power_State definitions. Other information may also be
included such as cell size, noise margins, and timing information.
Also optional are the dependencies of a cell’s power consumption
characteristics on variables such as input signal transition times, output
loading, power supply voltage, process corner, and temperature. Multi-
dimensional tables are normally used to represent these dependencies;
however, to minimize model complexity only two or three dimensions are
commonly used, with input transition time and output loading being the most
common dependencies considered. For the example inverter above, the
Power_Event for the input I changing from 1 to 0 might be represented as
follows:

Power_Event { { Condition = 10 I }
{Input_trans_time = 3.00e-02 4.00e-01 1.50e+00 3.00e+00 }
{Output_cap = 3.50e-04 3.85e-02 1.47e-01 3.11e-01 }
{ Energy = 1.11e-02 1.92e-02 4.59e-02 8.23e-02
7.31e-02 7.70e-02 9.60e-02 1.28e-01
2.45e-01 2.48e-01 2.59e-01 2.82e-01
5.06e-01 5.09e-01 5.16e-01 5.33e-01 }

Here the energy is represented by a 4x4 two-dimensional table for input


transition time and output load. Each line represents four different input
transition times with given load, while each of the four different lines
represents four different loads. Table sizes vary from library to library.
Larger sizes generally produce more accurate results since less interpolation
is needed between table entries, but it comes at the expense of increased
characterization time. Typical table sizes range from 3x3 to 10x10 for two-
dimensional tables.
Standard cell libraries will contain a power model such as this for each
one of the cells in the library. Each cell’s model will typically have power
consuming events defined for all possible logical events the cell might
experience. Similarly, all possible states will have static currents defined
although some libraries model only dynamic power.
Cells more complex than typical standard cells pose modeling challenges
since it is computationally too expensive to explicitly characterize and model
all possible logical events and states. Embedded memories and
The Design Automation View 423

microprocessors are two common examples, although even a simple cell


such as a four bit full adder (with 512 states) poses challenges. In these
cases, simplified models are utilized that define only the most significant or
simplest views of power consuming events. For example, memories are
often modeled by defining the power consumed by only a handful of events,
such as a Read access or a Write access. For microprocessors and related
logic structures, usually only a single energy value for a clock transition is
defined. While better than no model at all, such a simplistic model limits
estimation accuracy as well as the number and type of optimization
opportunities.

14.2.3.4 Modeling Standards

Several standard modeling languages exist for defining power models:


Liberty [4], ALF [5], and OLA (also known as IEEE standard 1481) [6]. For
simple cells, each language is usually sufficient; however for more complex
cells or more detailed modeling, the differences are significant.
Liberty and ALF are file-based approaches and similar in structure while
different in syntax. A complete15 example of a 2 input NAND gate power
model in ALF is presented below.

CELL ND2X1 {
AREA = 9.98e+00;
PIN A { DIRECTION = input ; CAPACITANCE = 4.04e-03; }
PIN B { DIRECTION = input ; CAPACITANCE = 3.84e-03; }
PIN Y { DIRECTION = output ; CAPACITANCE = 0.00e+00; }
FUNCTION { BEHAVIOR { Y = (!(A&&B)); } }

VECTOR ( 10A -> 01Y ) {


ENERGY { UNIT = 1.0e-12 ;
HEADER {
SLEWRATE { PIN = A ;
TABLE { 3.00e-02 4.00e-01 1.50e+00 3.00e+00 } }
CAPACITANCE { PIN = Y ;
TABLE { 3.50e-04 3.85e-02 1.47e-01 3.12e-01 } }
}
TABLE { 1.11e-02 1.92e-02 4.59e-02 8.23e-02

15
This model is complete in the sense that it models all significant dynamic and static power
consuming events. However, there are four usually insignificant non-zero dynamic power
consuming events that are not represented: rising and falling transitions on each of the two
inputs that do not result in a change on the output (the other input is in the low state). Also
not shown are timing or noise data, which would be needed for timing and noise margin
calculations and power vs. performance optimizations.
424 Tools and Methodologies for Power Sensitive Design

7.31e-02 7.70e-02 9.60e-02 1.28e-01


2.45e-01 2.48e-01 2.59e-01 2.83e-01
5.06e-01 5.09e-01 5.16e-01 5.33e-01 }
}
}
VECTOR ( 01A -> 10Y ) {
ENERGY { UNIT = 1.0e-12 ;
HEADER {
SLEWRATE { PIN = A ;
TABLE { 3.00e-02 4.00e-01 1.50e+00 3.00e+00 } }
CAPACITANCE { PIN = Y ;
TABLE { 3.50e-04 3.85e-02 1.47e-01 3.12e-01 } }
}
TABLE { 5.98e-04 6.27e-03 3.22e-02 6.85e-02
6.24e-02 6.44e-02 8.07e-02 1.10e-01
2.38e-01 2.39e-01 2.48e-01 2.68e-01
5.04e-01 5.05e-01 5.11e-01 5.24e-01 }
}
}
VECTOR ( 10B -> 01Y ) {
ENERGY { UNIT = 1.0e-12 ;
HEADER {
SLEWRATE { PIN = B ;
TABLE { 3.00e-02 4.00e-01 1.50e+00 3.00e+00 } }
CAPACITANCE { PIN = Y ;
TABLE { 3.50e-04 3.85e-02 1.47e-01 3.12e-01 } }
}
TABLE { 1.53e-02 2.31e-02 5.05e-02 8.78e-02
7.65e-02 8.08e-02 1.01e-01 1.35e-01
2.49e-01 2.52e-01 2.64e-01 2.89e-01
5.10e-01 5.12e-01 5.20e-01 5.38e-01 }
}
}
VECTOR ( 01B -> 10Y ) {
ENERGY { UNIT = 1.0e-12 ;
HEADER {
SLEWRATE { PIN = B ;
TABLE { 3.00e-02 4.00e-01 1.50e+00 3.00e+00 } }
CAPACITANCE { PIN = Y ;
TABLE { 3.50e-04 3.85e-02 1.47e-01 3.12e-01 } }
}
TABLE { 1.27e-03 5.37e-03 3.17e-02 6.88e-02
6.29e-02 6.42e-02 8.05e-02 1.11e-01
2.38e-01 2.39e-01 2.48e-01 2.68e-01
5.05e-01 5.05e-01 5.11e-01 5.25e-01 }
The Design Automation View 425

}
}
VECTOR ( !A && !B ) { POWER = 3607.79 {UNIT = 1.0e-12;} }
VECTOR ( !A && B ) { POWER = 3643.09 {UNIT = 1.0e-12;} }
VECTOR ( A && !B ) { POWER = 9973.64 {UNIT = 1.0e-12;} }
VECTOR ( A && B ) { POWER = 1219.60 {UNIT = 1.0e-12;} }
}.

In an ALF model, a VECTOR denotes a power-consuming event. In


each of the VECTORs of this model, table indices for transition time
(SLEWRATE) and output load (CAPACITANCE) are listed and denote the
conditions under which each particular point in each energy table was
generated. Also note that leakage power has been characterized and
specified by the last four VECTORs in the model.
This cell modeled in Liberty would effectively include the same data,
although with different constructs and one significant difference – the energy
consumed in driving the load capacitance would not be included in the table
of energy values, while in ALF that energy is included.

OLA, by contrast, is a binary compiled library with an API that returns


data when called by a power tool. OLA does not specify a file format, as
ALF and Liberty do, but rather defines a set of procedure calls by which
access is gained to power data contained in the compiled library. As such,
OLA specifies no particular modeling philosophy – power can be modeled
using tables or equations and the models can be arbitrarily complex.
However, the API supports three different modes of computing cell power:
All Events Trace (in which a cell’s state is tracked by the OLA library while
activity is tracked on a per pin basis by the power tool), Groups (in which
the OLA library returns power based on a cell’s input states while the power
tool tracks input state), and Pins (in which only pin activities, without
regards to state, are tracked by the power tool).
426 Tools and Methodologies for Power Sensitive Design

14.2.4 Different Types of Power Measurements

Power can be measured in a number of different ways, and depending on the


issue or application, a different measurement type is required. The major
types of power measurements are Instantaneous, RMS, and Time Averaged.
In many cases, however, it is important to recognize the differences between
power consumption and power dissipation, independent of the measurement
type.

14.2.4.1 Power Dissipation and Power Consumption

Power consumption is defined to be the amount of energy consumed, or


withdrawn, from a power supply in unit time. By contrast, power dissipation
is defined to be the amount of energy dissipated – converted into another
form, such as heat – in unit time by a particular circuit element or group of
elements.
Consider the inverter shown in Figure 14.1. Upon receiving a falling
edge on the input pin, the p-channel pull-up transistor turns on, resulting in a
charging current that deposits enough charge on the load capacitor for
it to reach a voltage of During this event, the inverter switching state
from an output low to an output high, the amount of energy drawn or
consumed from the supply (neglecting is equal to When the
inverter switches again, from an output high to an output low, no current is
drawn from the supply – and no energy is consumed – but the current is
sunk into through the n-channel pull-down.
However, the energy dissipation view is different. During the charging
event, the amount of energy dissipated is equal to This energy is
dissipated – converted into heat – in the resistive elements in the charging
path, especially the channel resistance of the transistors. When the capacitor
is discharged to ground, the remaining amount of energy, is
dissipated in the resistances in the pull-down path.
Thus, we see that for a single event, power consumption and power
dissipation are different but are the same when a complementary pair of
charging and discharging events are considered.
In the following discussions of measurements and tools, all references to
power reflect the power consumption definition.

14.2.4.2 Instantaneous Power

Instantaneous measurements reflect the power being consumed at a


particular instant in time. Design automation actually views this instant as
having a duration, known as a time-step. This time-step varies from
Transistor Level Tools 427

application to application, typically from pico-seconds (for cell


characterization) to micro-seconds (for full-chip analysis). Smaller time-
steps result in more detailed and more precise waveform measurements but
at the expense of increased run times. Instantaneous values are used to
determine such issues as power rail noise injection as well as power supply
transient response requirements.

14.2.4.3 RMS Power

RMS measurements are used when a single value is needed that describes
relatively long term behavior while at the same time paying special attention
to the peak values. Such is the case for evaluating electromigration current
limits that are dependent on the average value of the current as well as its
peaks. This is especially true for current flow in signal lines, which is bi-
directional and hence would have a very small average current value [7].

14.2.4.4 Time Averaged Power

Time averaged power represents the amount of power consumed over a


relatively lengthy period of time (multiple clock cycles, for example) and is
simply the average amount of power consumed during the measurement
period. In terms of basic physics, it is the amount of energy consumed
divided by the measurement period. Time averaged power is used to
calculate junction temperatures and to determine battery life. It is the
measurement type most commonly considered when power is a design issue.
Typical measurement periods are in the microsecond to millisecond range or
even longer.

14.3 TRANSISTOR LEVEL TOOLS

Transistor level tools are generally the most accurate and the most familiar
to IC designers. In fact, accuracy and the well-accepted abstraction are their
primary advantages. Nonetheless, these tools have significant issues in their
applicability to Power Sensitive Design: capacity and run time
characteristics limit their use to moderately sized circuits, or limited amounts
of simulation vectors for larger circuits. However, perhaps the biggest
limitation is that one must have a transistor level design to analyze before
these tools can be used; in other words, a design must be completed in large
part before these tools can be effective.
These tools are utilized primarily in two different use models. The first is
for characterizing circuit elements in order to create timing and power
428 Tools and Methodologies for Power Sensitive Design

models for use at the higher abstraction layers. The second is for verifying,
with the highest levels of accuracy, that the completed transistor level design
meets the targeted power specifications.

14.3.1 Transistor Level Analysis Tools

Transistor level analysis tools provide the bedrock for IC design and most IC
designers understand and rely upon them. There are two different classes of
transistor level analysis tools, generalized circuit simulators and switch level
simulators.
Generalized circuit simulators are used for many different purposes
including timing analysis for digital and analog circuits, transmission line
analysis and circuit characterization. These types of tools are regarded as the
standard for all other analysis approaches since they are designed to model
transistor behavior, at its fundamental levels, as accurately as possible and
are capable of producing any of the three types of power measurements. The
use of a circuit simulator for power analysis is simply one of its many
applications. The primary example of this type of tool is SPICE, of which
there are many variants [3].
Switch level simulators are constructed differently than circuit
simulators. Whereas the latter utilizes many detailed equations to model
transistor behavior under a wide variety of conditions, switch level
simulators model each transistor as a non-ideal switch with certain electrical
properties. This modeling simplification results in substantial capacity and
run-time improvements over circuit simulators with only a slight decrease in
accuracy. For most digital circuits, this approach to electrical simulation at
the transistor level is very effective for designs too large to be handled by
SPICE [8]. The leading examples of switch-level power simulators are
NanoSim [9] and its predecessor PowerMill, from Synopsys.

14.3.2 Transistor Level Optimization Tools

A transistor level optimization tool is fundamentally a transistor-sizing tool.


The concept is to trade-off power against delay by employing the smallest
transistors possible while still meeting timing requirements. This type of
tool is used only in custom design, where predesigned circuits or cell
libraries are not utilized. Instead, each leaf level circuit is designed from the
ground up using a continuous range of transistor sizes.
A transistor level power optimizer reads in a transistor level netlist of the
circuit to be optimized along with a set of timing constraints. On paths
where a positive timing margin exists, transistor sizes are reduced in order to
reduce the power consumed. This procedure is repeated for a given path
Transistor Level Tools 429

until the timing margin is used up, at which point other paths are considered
for optimization. After all paths have been considered for optimization, a
new transistor level netlist is produced containing resized transistors. Power
reductions in the range of 25% per module are possible compared to un-
optimized circuitry. An example of a transistor-level power optimizer is
AMPS from Synopsys [10].

14.3.3 Transistor Level Characterization and Modeling Tools

As described above, transistor-level analysis tools are used to characterize


individual circuit primitives in order to produce models for use by tools
operating at higher levels of abstraction. The process of circuit
characterization and model building can be automated by the use of
modeling tools that in turn use an analysis tool as an internal engine [10].
These modelers work by driving and controlling SPICE simulations in
order to create a database of characterization information describing the
power and timing characteristics of the cells being characterized. Required
inputs for a modeler include a transistor level netlist and functional
descriptions for each cell to be characterized along with manufacturing
process definitions and characterization conditions. The modeler uses this
information to create and launch hundreds, or even thousands, of SPICE
simulations for each cell, varying such parameters as input transition time
for each pin, output loading, temperature, and power supply voltage in order
to build up a database of characterization data describing each cell’s
operation under a wide variety of operating conditions. The modeler drives
SPICE by setting the specific operating points for each individual
simulation. It also performs automatic stimulus generation for each
simulation by evaluating the functional definition and then exhaustively
creating all possible combinations. This, of course, limits the automatic
stimulus generation to relatively basic cells such as those found in a standard
cell library.
Once the database has been built up, the data is then written out in
standard formats such as ALF, Liberty, or OLA. The outputs are fully
defined models describing each cell’s functional, timing, and power
behavior. An example of such a model in the ALF format was shown above
in the section on modeling standards. Figure 14.5 below shows the basic
architecture of such a characterization and modeling tool.
It should be noted that while the above description references SPICE as
the internal analysis engine, any transistor level analysis tool could be used
as the calculation engine.
An example of a modeling tool such as this is SiliconSmart-CR from
Silicon Metrics [12].
430 Tools and Methodologies for Power Sensitive Design

14.3.4 Derivative Transistor Level Tools

One of the most important types of power tools is the power grid analyzer.
Such a tool analyzes a design’s power delivery network to determine how
much voltage drop occurs at different points on the network [13] and to
evaluate the network’s reliability in terms of sensitivity to electromigration.
A transistor-level power grid analyzer is composed of 3 key elements.
The first is a transistor-level simulator (either a circuit or switch level
simulator, as described above), a matrix solver, and a graphical interface.
The simulator portion is used to compute the circuit’s logical and electrical
response to the applied stimulus and to feed each transistor’s power sinking
or sourcing characteristics to the matrix solver. The matrix solver, in turn,
converts each transistor’s power (or equivalent current) characteristics into
nodal voltages along the power distribution network by calculating the
branch currents in each section of the power grid. These results are then
displayed in the GUI, usually as a color-coded geographic display of the
design.
Gate-level Tools 431

The inputs to a power grid analyzer are the extracted transistor level
netlist along with an extracted RC network for the power and ground rails.
A stimulus is required to excite the circuit, although the form of the
excitation could be either static or dynamic. In the former case, the resulting
analysis would be a DC analysis. For a transient analysis of the power
distribution network, a dynamic stimulus is required.
The outputs of a power grid analyzer are a graphical display illustrating
the gradient of voltages along the power rails, and optionally, a graphical
display of current densities at different points on the rails. This latter display
is used to highlight electromigration violations, as electromigration
sensitivity is a function of current density.
RailMill from Synopsys is an example of such a tool operating at the
transistor level [14].

14.4 GATE-LEVEL TOOLS

Gate-level power tools operate by computing power on a cell-by-cell basis,


combining nodal activities obtained from a logical simulation of the gate-
level netlist with cell specific power data from a power library.
Like transistor-level power tools, the gate-level tools are generally well
understood. Compared to the transistor-level tools, gate-level tools trade off
accuracy for significant improvements in run time and capacity. They also
tend to fit into ASIC (application-specified integrated circuit), or cell based,
design flows much better than transistor level tools since the ASIC flow is
gate-level based. However, gate-level tools are still limited in overall
capacity and are generally used more in a verification role than a design role
since a design must be completed, synthesized, and simulated before
meaningful power results can be obtained. That is, gate-level tools are used
to verify a design’s power characteristics once the design has been
completed or is in the final stages of completion.
Gate-level power verification is usually much faster and more efficient
than transistor-level verification, provided that good gate-level models exist.
Nonetheless, gate-level tools suffer from one of the same issues that afflict
transistor level tools, which is that of Visualization – the forest can’t be seen
for the trees. Tools at both abstractions work at sufficiently low levels that it
is challenging for a designer to understand how to use the output information
in order to improve the design’s power characteristics: what should be done
about a particular logic gate or transistor that appears to consume an
inordinate amount of power?
432 Tools and Methodologies for Power Sensitive Design

14.4.1 Gate-Level Analysis Tools

Gate-Level analysis tools compute the power consumed by each logic


element in a design [15]. The logic elements, often referred to as cells or
gates, can vary in complexity from simple functions such as an inverter,
NAND gate, or flip-flop, to more complex elements such as PLLs, static
RAMs, or embedded microprocessors. The key concept in gate-level power
analysis is that the design is completely structural in nature – that is, the
design representation is that of a netlist of interconnected cells – and that
each cell is represented by a single power model.
A gate-level power analysis tool reads the structural netlist, a library of
power models, activity information describing the type and amount of signal
activity for each cell, and (optionally) wiring capacitance data. The netlist is
usually in one of two standard HDL formats, either Verilog or VHDL,
although the former is dominant. The power models are read in one of three
industry standard formats (ALF, Liberty, or OLA). A file of signal activities
is produced by a logic simulator and written out in either .vcd or a more
condensed proprietary format.
The gate-level power analyzer loads the library, netlist, and activity data.
It then computes the power consumption of each cell individually. It begins
by analyzing the activity data to determine which events occur for each cell
and how often. Next, it computes the capacitive load on each cell using the
sum of the fanout capacitance and the wiring capacitance. If the wiring
capacitance is not explicitly specified, then the wiring capacitance is
estimated. The event data and the capacitive loading is used to determine
which of the cell’s power data tables are to be accessed and which data
within the table is to be used. Finally, after this has been accomplished for
each cell individually, the total power is calculated by simply summing the
power consumed by all the cells.
Two different gate level analysis tools are widely used, PowerTheater-
Analyst (and its predecessor Watt Watcher) [16] from Sequence Design and
PrimePower [17] from Synopsys. Both tools perform similar calculations
although differences exist in terms of file and library formats and display
options. Both tools function by post-processing activity data produced
during previously run logic simulations. The accuracy of both tools is
typically within a few percent of SPICE.

14.4.2 Gate-Level Optimization Tools

Gate-level optimization tools are also available. These tools usually take the
form of additional cost-function optimization routines within logic
synthesizers such that power consumption is optimized at the same time as
Gate-level Tools 433

timing and area during the process of converting RTL descriptions to logic
gates. The primary advantage of these optimizers is that of push-button
automation – the tools automatically search for power saving opportunities
and then implement the changes without violating any timing constraints.
The amount of power reduced varies widely, ranging from a few percent to
more than 25% for the synthesized logic depending on many factors, such as
the end application and the degree to which the original design had been
optimized [18].
Several types of power-saving transformations can be performed at the
gate level and these transformations can generally be categorized into two
groups – those that alter the structure of the netlist and those that maintain it.
The former group includes such transformations as clock gating, operator
isolation, logic restructuring, and pin swapping. Of these, clock gating is
usually the most effective and most widely utilized transformation.
Identifying clock gating opportunities is relatively straightforward, as is
the transformation. The netlist is searched for registers that are configured to
recirculate data when not being loaded with new data. These registers have
a two-to-one multiplexor in front of the D input. These structures are
replaced, as shown in Figure 14.6, with a clock gated register16 that is
clocked only when enabled to load new data into the register, The result is
often a substantial reduction in dynamic power in the clock network.
Note that simulation data is not required to identify clock gating
opportunities. However, depending on the actual activities, gating a clock
can actually cause power consumption to increase – this is the case when the
register being clock-gated switches states often. Nonetheless, the attraction
of gating clocks “blindly” is that the overall design flow is simplified, as
meaningful simulations often take many hours.
It is often desirable to reduce power without altering the netlist structure,
and two gate-level transformations are often employed to do this – cell re-
sizing and dual-Vt cell swapping. The former technique is employed to
reduce dynamic power and is similar to the sizing technique employed at the
transistor level – cells off of the critical timing path are downsized to the
extent that timing is not violated, using different drive strength cells from the
library. Dual-Vt cell swapping is a similar transformation in that positive
timing slack is traded off for reduced power by replacing cells that are not on

16
In practice, the clock gating logic is usually more complicated than a simple AND gate.
The additional complexity arises out of two requirements, first to ensure that the clock
gating logic will not glitch and secondly, to make the clock gating logic testable. For
example, the latch used in the gated register in Figure 6 serves to prevent the AND gate
output from glitching. Additional logic beyond that shown would be required to make this
circuitry testable.
434 Tools and Methodologies for Power Sensitive Design
Gate-level Tools 435

the critical path. In this case, the target is leakage power reduction and the
replacement cells have different timing characteristics by virtue of a
different threshold voltage implant as opposed to a different size or drive
strength. To utilize dual-Vt cell swapping, however, a second library is
required that is composed of cells identical to those in the original library
except for the utilization of a different, higher threshold voltage. Both of
these techniques are most effectively performed once actual routing
parasitics are known, while the logic restructuring techniques are best
performed pre-route or pre-placement.
Figure 14.7 illustrates the effects on the path delay distribution of trading
off slack path delays for reduced power, as achieved by either cell re-sizing
or dual-Vt cell swapping [19].
Several gate-level power optimizers are commercially available,
PowerCompiler from Synopsys [20], the Low-power Synthesis Option for
BuildGates from Cadence [21], and Physical Studio from Sequence [22].
The first two work as add-on optimization routines to standard synthesizers
so that power consumption is optimized at the same time as timing and area
when the synthesizers are converting RTL descriptions to logic gates. The
latter functions as a post-place-and-route power optimizer utilizing extracted
wiring parasitics.

14.4.3 Gate-Level Modeling Tools

The role of a gate-level modeler is to create models, especially for complex


cells such as memories or functional blocks, to be used by gate and higher-
level power analysis and optimization tools. The best example of such a
modeler is a memory compiler.
Memory compilers are used to create layouts and models for a variety of
different sizes and types of memories, such as RAMs, ROMs, and CAMs.
For a given memory type and size, the compiler generates a number of
models including functional, timing, and power models along with the actual
memory layout. The power data that is stuffed into the models is based upon
characterization data stored within the compiler, the characterization data
being previously generated by a number of transistor-level simulations of
representative memory sizes and organization options.
The detailed content of the power models generated by compilers varies
widely but generally includes data for the major power-consuming events
such as Read and Write operations. More detailed models will include
additional control signal state dependencies, input and output pin switching
effects, and static-power data.
Memory compilers that generate power models are available from Virage
Logic [23] as well as most ASIC suppliers.
436 Tools and Methodologies for Power Sensitive Design

14.4.4 Derivative Gate-Level Tools

Three types of derivative gate-level tools are noteworthy: power grid


analysis (similar to that mentioned above at the transistor level), power grid
planning, and timing analysis.
Gate-level power grid analysis tools are similar to the transistor-level
power grid analysis tools described above except that instead of analyzing a
transistor-level netlist obtained from layout, an extracted cell-level netlist is
analyzed. Depending upon implementation, other differences may exist as
well. For example, the transistor-level tools, due to their fundamental
architecture, calculate power along with power grid voltages and currents
concurrently with the functional simulation. Some gate level power grid
analysis tools compute instantaneous voltage values while others compute
only effective DC values.
Register Transfer-level Tools 437

The most prevalent power grid analysis tool architecture relies upon the
computation of an average power value for each cell, based on a gate level
simulation as might be performed by a gate-level power analyzer. These
power values are then converted to DC current sources and are attached to
the power grid at the appropriate points per cell. This information, along
with the extracted resistive network, is fed to a matrix solver to compute DC
branch currents and node voltages. This type of analysis is referred to as a
static, or DC, analysis, since the currents used in the voltage analysis are
assumed to be time invariant in order to simplify the tool architecture and
analysis type. Mars-Rail from Avanti [24] and VoltageStorm-SoC [25] are
examples of a gate-level static power grid analysis tool, although the latter
also incorporates some transistor-level analysis capabilities.
Power grid planning tools assist in the creation of the power grid before,
or during, placement and routing. The idea is to design and customize the
power grid to the specifics of the design’s power consumption characteristics
and floor plan. Power grid planners require as inputs an estimate of the
design’s power consumption broken down to some level of detail, a floor
plan for the design, and information about the resistive characteristics of the
manufacturing process routing layers and vias. The tool then produces a
route of the power and ground networks with estimates of the voltage
variations along the network. An example of this type of tool is Iota’s
PowerPlanner [26].
The third type of derivative gate-level power tool is a power-sensitive
delay calculator. Conventional delay calculators compute timing delays
based on the well-known factors of input transition time and output loading,
assuming no voltage variation along the power rails. However, voltage
variations do occur, in both space and time and affect signal timings. More
recent delay calculators, such as that found in ShowTime from Sequence
Design, incorporate the effects of localized power supply variations when
computing delays [27].

14.5 REGISTER TRANSFER-LEVEL TOOLS

The register transfer level, or RTL, is the abstraction level at which much of
IC functional design is performed today. As such, analysis and optimization
at this level is especially important.
Two other considerations account for the significance of tools at this
level. The first is that this is the level at which designers think about
architecting their logic, and it is also at this point that many opportunities for
major power savings exist. Secondly, the design-analyze-modify loop is
438 Tools and Methodologies for Power Sensitive Design

much faster and convenient when compared to synthesizing the logic in


order to perform a gate-level power analysis.
From a basic tool perspective, RTL power tools fundamentally run much
faster and require less disk and memory space than equivalently functional
gate-level tools because the database is much smaller. In terms of speed,
RTL tools are about an order of magnitude faster than gate-level tools, which
in turn are about an order of magnitude faster than transistor-level tools.
These advantages come at the cost of accuracy – since fewer details are
known at this level compared to the gate or transistor levels, the accuracy of
the results is not as high as at the lower levels.
Also in comparison to the gate and transistor level tools, the used models
are different. RTL power tools are primarily used as design tools, as aides in
the design creation process. By contrast, the gate and transistor level tools
are used primarily as verification tools, since by the time that a chip-level
design reaches that abstraction, most of the creative part of the design
process is completed. Thus the gate and transistor-level tools are used to
verify that the design is on target to meet the power-consumption target.
Two obvious exceptions to this generalization are, of course, custom circuit
design, in which the transistor-level tools are used in the design of the
custom circuits, and characterization, in which the transistor-level tools are
used to build gate-level models.

14.5.1 RTL Analysis Tools

RTL power analysis tools estimate the power consumption characteristics of


a design described at the register transfer level in hardware description
languages such as Verilog or VHDL. The estimation is performed prior to
synthesis and the results are directly linked to the RTL source code to
facilitate an understanding of which portions of the code result in how much
power consumption.
RTL power analyzers operate by reading the RTL code and performing a
micro-architectural synthesis, known as inferencing, which converts the RTL
code into a netlist of parameterized micro-architectural constructs such as
adders, decoders, registers, and memories. Power is estimated utilizing a
divide and conquer approach wherein different estimators are used for each
of the different types of constructs inferred from the RTL. For each of these
constructs the RTL analyzer utilizes high-level power models that, when
supplied with activity data, will compute power consumption on an instance
specific basis17. Elements not yet present in the design, such as clock

17
An instance in this case is an operator, or several lines of code implementing an inferred
function, such as an adder or multiplier, decoder or multiplexor.
Register Transfer-level Tools 439

distribution, test logic, or wiring capacitances, are all estimated according to


specifications of future changes to the design in the subsequent design flow
steps. For example, clock power is estimated by building a symbolic clock
tree using the same rules that would later be used to build the actual clock
tree during physical design.
As with the gate-level power analyzers, both power libraries and activity
data are required to compute power. The power libraries used for gate-level
analysis can also be used at this level, but the activity data is captured from
RTL simulations. Also similar to the gate-level tools, both average power
and time-varying, or instantaneous, power can be estimated. However,
unlike the gate-level analyzers, which report results referencing individual
gates, the power estimates produced by an RTL analyzer refer to lines of
RTL code.

The accuracy of RTL estimates is not as high as that achieved by gate-


level analyzers but nonetheless can be within 20% of actual measurements.
Some of the factors that contribute to the difference in accuracy between the
gate-level and the RTL estimates are the use of wire-load models for wiring
440 Tools and Methodologies for Power Sensitive Design

capacitance estimation and the uncertainty to which various logic


optimizations will be employed later during the synthesis process.
Figure 14.9 illustrates the architecture of an RTL analysis tool such as
PowerTheater-Analyst from Sequence Design. [16]. As described above, it
can also be used as a gate-level power analyzer, making it well-suited for use
in a methodology of successively refining estimates from the early stages of
a design project to a post place-and-route verification.

14.5.2 RTL Optimization Tools

An RTL power optimization tool is one that reads, as input, an RTL


description and produces as output a power-reduced RTL description. A
relatively large universe of optimizations are possible at this level, some
examples of which are memory restructuring, operator isolation, clock
gating, datapath reordering, and operator reduction.
An optimizer working at the register transfer level works by searching
the RTL design database for particular constructs known to be optimizable.
An illustrative example is that of isolating a datapath operation. Consider a
multiplier whose output feeds a multiplexor input, implying that the
multiplier’s results are not used all the time. Thus, when the multiplier’s
results are not being used, the operation and the power consumed by that
operation are being wasted. In this case, power can be reduced by placing a
latch, controlled by the multiplexor select line, in front of the multiplier’s
inputs. This will prevent the multiplier from computing new data unless the
downstream multiplexor intends to let the results pass through.
For most potential transformations, the optimizer must utilize activity
data to calculate the actual power savings, as some changes may cause
power consumption to increase depending upon the details of the circuit’s
activity. Clock gating changes are one example of a transformation that can
cause power to increase under certain circumstances. Another is memory
system restructuring, since many power-saving memory transformations are
dependent upon the exploitation of certain access patterns in order to reduce
power.
PowerTheater-Designer from Sequence Design [16] reads an RTL
description and produces as output power optimized RTL.

14.6 BEHAVIOR-LEVEL TOOLS

Similar to their brethren at the RT-level, behavior-level tools are used during
the design creation process. Modules or entire designs can be described at
this level. Two different motivations exist for describing designs
Behavior-level Tools 441

behaviorally: simulation and synthesis. In the former case, a model is


created for the purpose of simulation in order to prove the overall intended
functionality. In the latter case, in addition to enabling functional
simulation, the model also serves as the source design for subsequent
synthesis to RTL and gates. In addition, since power reduction opportunities
are largest at the highest abstraction levels, the ability to evaluate power
reduction tradeoffs at this level can be especially effective.
One of the significant differences between designs described at the RT
and behavioral levels is that of clock cycle functionality – the number and
type of clocks that need to be explicitly specified. Designs described at the
RT-level assume that this is fixed, and RTL synthesizers follow through on
that assumption in that they will not attempt to move (or retime) the
functionality across a clock cycle boundary. By comparison, behavioral
descriptions make no such assumption and behavioral synthesizers will
allocate the functionality across multiple clock cycles. This feature of
behavioral synthesis can have a significant impact on both power analysis
and optimization tools at the behavior-level.

14.6.1 Behavior-Level Analysis Tools

Behavior-level analysis tools read designs written behaviorally in an HDL or


a common high-level programming language such as C or C++. These tools
are used to assist with algorithm and architecture design and to help assess
the impact of tradeoffs such as different data encodings, memory mappings,
and hardware/software partitioning.
Like their counterparts at the RT and gate-levels, activity and power data
are required in order to produce a power estimate. And, much like RTL
power analysis tools, a mapping between language constructs and hardware
objects must be made in order to enable a power estimate. However, this is a
much more challenging task than at the RT-level because schedules are not
yet fixed and many different hardware implementations can result from a
given behavioral description, each of which may have dramatically different
power characteristics.
One approach to the issue of many different potential implementations is
to analyze the various combinations of scheduling, allocation, and binding
and then report the results as a spread of minimum to maximum power
consumption, along with the resulting performance and area estimates.
Orinicco, from Offis, performs this type of analysis using a macro-based
approach and produces as output a set of constraints with which to drive a
behavioral synthesis [28].
442 Tools and Methodologies for Power Sensitive Design

14.6.2 Behavior-Level Optimization Tools

While the wide variance in power consumption characteristics between the


different potential implementations of a behavioral description presents
challenges for an analysis tool, these challenges are actually opportunities
for an optimization tool.
Behavior-level optimization involves elaborating the target design to
create many different versions automatically, analyzing each one for power,
and then selecting the optimal design to write out. Each of the elaborated
versions will be mapped onto known, or pre-characterized, objects for which
power data or power models are available. Of significant issue here is the
number and type of objects for which power models exist. On the one hand,
more objects are desirable since this effectively expands the search space.
On the other hand, additional objects enlarge the modeling and
characterization effort, which must be repeated for each target technology.
For example, consider the number and type of adders that must be available
for the optimizer to consider: various bit widths for ripple-carry, carry-look-
ahead, and carry-skip adders. If only modulo-4 bit versions are considered,
96 different adder models must be available to support 128 bit datapaths.
It is for this reason that pre-characterized high-level objects are generally
difficult to employ for optimization. A different approach, which avoids this
issue, is to ignore power models and calculations altogether and limit the
transformation search space to those transformations that are known a priori,
in most cases at least, to reduce power without requiring an explicit power
calculation step. Examples here are rescheduling control and data flow to
enable more clock gating or to reduce the number of memory accesses.
Despite these issues, behavior-level optimization tools do exist, an
example of which is Atomium, from IMEC [29]. Atomium targets memory-
system optimization by minimizing overall memory storage requirements as
well as the number of memory accesses. Atomium reads a C-language
description and produces a transformed description, also in C.

14.7 SYSTEM-LEVEL TOOLS

The analysis and optimization of power at the system level involves the
global consideration of voltages and frequencies. For example, system-level
concerns about battery types and availability often dictate the voltages at
which portable devices must operate. Similarly, thermal and mechanical
issues in laptops often limit microprocessor operating frequencies to less
than their maximum operating capability. Accordingly, the proper analysis
A Power-sensitive Design Methodology 443

of these and related concerns often has the highest impact on the overall
power characteristics and success of the target design.
Unfortunately, little in the way of design automation is available for
addressing these concerns. In fact, the most prevalent software tool used in
this arena is the spreadsheet.
Spreadsheets are fast, flexible, and generally well understood. In fact,
spreadsheets were adopted for chip-level power estimation prior to the
emergence of dedicated power analysis tools [30]. The capabilities that
made spreadsheets applicable to chip-level power analysis are also
applicable to system-level analysis – ease of use, modeling flexibility, and
customizability. Unfortunately, the disadvantages are also applicable – error
prone nature, wide accuracy variance, and manual interface.
Nonetheless, spreadsheets such as Microsoft’s Excel are used to model
entire systems. System components and sub-blocks are modeled with
customized equations using parameters such as supply voltage, operating
frequency, and effective switched capacitances. Technology data may or
may not be explicitly parameterized, but if it is, it is typically derived from
data-book information published by the component and technology vendors.
Spreadsheets are most often utilized at the very earliest stages of system
design when rough cuts at power budgets and targets are needed and before
little detailed work has begun. As the design progresses and descends
through the various abstraction levels, and as more capable and automated
tools become applicable, spreadsheet usage typically wanes in the face of
more reliable and more automated calculations.

14.8 A POWER-SENSITIVE DESIGN METHODOLOGY

To fully manage and optimize power consumption, a design methodology


must address power consumption and power-related concerns at each stage
of the design process and at each level of design abstraction. Target power
specifications must be developed at the very beginning, and the design
should be checked against these specifications at each abstraction level and
in each design review.
The previous sections presented overviews of the various tool types at
each abstraction layer for dealing with power. However, to utilize these
tools most effectively, a methodology for design that utilizes these tools is
required.
444 Tools and Methodologies for Power Sensitive Design

14.8.1 Power-Sensitive Design

The conventional design view of power is that there are two primary design
activities, analysis and minimization, the latter often known as Low Power
Design. An example of this would be the design of the StrongArm
microprocessor, whose design target was low power (less than a watt) with
good performance [31].
But low power processors are not the only designs in which power is a
concern. Consider the 21264 Alpha microprocessor, which consumes 72 W
while being clocked at 600 MHz [32]. Designers of this device had to
consider many power-related, or power-sensitive, issues such as package
thermal characteristics, temperature calculations and thermal gradients,
power bus sizing, di/dt transient currents, and noise margin analysis [33] in
addition to the various power-savings techniques that prevented this machine
from consuming even more power.
Rolled together, the consideration of all these issues, power
minimization, and the analysis and management of those parameters affected
by power, constitutes Power Sensitive Design.

14.8.2 Feedback vs. Feed Forward

Much of digital design today is accomplished utilizing a top-down or


modified top-down design flow in which top refers to the higher levels of
design abstraction, such as the system, behavior, and RT-levels, and time
flows downward towards the lower levels of design abstraction such as the
gate and transistor levels. In this case, flow refers to the sequence of tasks;
however, the flow of detailed design information is somewhat less clear.
In conventional practice, detailed design information tends to follow a
Feedback flow wherein information regarding particular power
characteristics does not become available until the design has progressed to
the lower abstraction levels. A Feedback design flow features a relatively
lengthy feedback loop from the analysis results obtained at the gate or
transistor level to the design tasks at the RT-level and above. Thus
information about the design’s power characteristics is not obtained until
quite late in the design process. Once this information is available, it is fed-
back to the higher abstraction levels to be used in determining what to do to
deal with the power issues of concern. The farther the lower-level power
analysis results are in excess of the target specification the higher in the
abstraction levels one must return in order to change the design to try and
meet that specification.
In the Feed Forward approach, illustrated in Figure 14.10, these lengthy,
cross synthesis, cross-abstraction feedback loops are replaced with more
A Power-sensitive Design Methodology 445

efficient abstraction specific loops. Thus the design that is fed forward to
the lower abstraction levels is much less likely to be fed back for reworking,
and the analysis performed at the lower levels becomes less of a design
effort and more of a verification effort. The key concept is to identify, as
early as possible, the design parameters and trade-offs that are required to
meet the project’s power specs. In this way, it is ensured that the design
being fed forward is fundamentally capable of achieving the power targets.

The Feed Forward flow is enabled by effective high-level – RT and


above – analysis tools that can accurately predict power characteristics.
These early estimation capabilities enable the designer to confidently assess
design tradeoffs without having to resort to detailed design efforts or low-
level implementations in order to assess performance against the target
power specification. Compared with the traditional top-down methods, the
key difference and advantage is added by the early prediction technology.
446 Tools and Methodologies for Power Sensitive Design

For example, the design of a low-power appliance might begin with a


spreadsheet analysis to consider trade-offs between parameters such as
which embedded processor, operating frequency, and supply voltage(s) to
use. Once the estimates indicate that the system being considered could
meet the target power spec, given constraints such as cost, battery
technology, and available IP, then the development proceeds to more
detailed design at the behavioral or RT-levels. Here the effects of different
algorithms and different hardware architectures can be explored, and the
estimates previously performed with the spreadsheet are re-checked and
refined using high-level estimators along with the more detailed information
that resulted from the most recent design efforts. The development does not
progress to the lower abstraction levels until the latest estimate shows that
the power specs can be met.
Proceeding parallel to, or sometimes ahead of, the architecture
development is the design of the library macro functions and custom
elements such as datapath cells. These are used in the subsequent
implementation phase in which the RTL design is converted into a gate-level
netlist. At this point, appropriate optimizations are performed again and
power is re-estimated with more detailed information such as floor-planned
wiring capacitances. The power grid is planned and laid out using this
power data.
After the design has been placed and routed, another set of power
optimizations can be performed, this time using the extracted wiring
capacitances while trading-off slack timing for reduced dynamic or leakage
power. Lastly, as part of the final tape-out verification and signoff, power is
calculated and used to compute and validate key design parameters such as
total power consumption in active and standby modes, junction
temperatures, power supply droop, noise margins, and signal delays.
Thus, power is analyzed multiple times at each abstraction layer for a
design following the Feed Forward approach. Each analysis is successively
refined from the previous analysis by using information fed forward from
prior design decisions along with new details produced by the most recent
design activities. This approach encourages design efforts to be spent up
front, at the higher abstraction levels, where design efforts are most effective
in terms of minimizing and controlling power [34]. Also, because power
sensitive issues are tracked from the beginning to the end, late surprises are
minimized.
A View to The Future 447

14.9 A VIEW TO THE FUTURE

As designs become larger, and both functionally and electrically more


complex, the process of design and the role of design automation must adapt.
One adaptation, already underway, is the move to perform more and more of
the design effort at the higher abstraction levels. New approaches to
minimizing power will emerge, such as power-optimizing software
compilers. These approaches will require the development of detailed, yet
computationally efficient, power models for complex objects such as
microprocessors, micro-controllers, and network interfaces.
Early predictions of thermal concerns and leakage effects will also
become more prevalent. Additionally, much more attention will be spent on
verifying and optimizing power characteristics at the RT-level since designs
can be analyzed at this stage with adequate accuracy without requiring the
extraordinary amounts of computational resources needed by lower level
tools. Yet, despite the resources required for conventional back-end tasks, a
new tape-out signoff requirement will emerge – Power Signoff.
Power Signoff will entail the full verification of power and its effects at
the physical level. This will include conventional measures such as power
consumption and power grid voltage drop along with effects previously
considered second order for many designs, such as thermal gradients, noise
injection, device-package interactions, and power-sensitive delay
calculation. Even though such a verification will entail lengthy run-times
and huge data-sets, it will be performed infrequently – only during the
signoff process – thus mitigating the cost in terms of time.
Additionally, vectorless tool operation – the ability to perform analyses
or optimizations without requiring simulation generated data – will become
more prevalent for power tools at all levels, particularly those operating on
entire chips since it will become impractical to perform meaningful
simulations in acceptable time frames.

14.10 SUMMARY

Power consumption and power-related issues have become a first-order


concern for most designs and loom as fundamental barriers for many others.
And while the primary method used to date for reducing power has been
supply voltage reduction, this technique begins to lose its effectiveness as
voltages drop to the 1 volt range and further reductions in the supply voltage
begin to create more problems than are solved. In this environment, the
process of design, and the automation tools required to support that process,
become the critical success factors.
448 Tools and Methodologies for Power Sensitive Design

In particular, several key elements emerge as enablers for an effective


Power Sensitive Design Methodology. The first is the availability of
accurate, comprehensive power models. The second is the existence of fast,
easy to use high level estimation and design exploration tools for analysis
and optimization during the design creation process, while the third is the
existence of highly accurate, high capacity verification tools for tape-out
power verification.
And, as befitting a first-order concern, successfully managing the various
power-related design issues will require that power be addressed at all
phases and in all aspects of design, especially during the earliest design and
planning activities. Advanced power tools will play central roles in these
efforts.

REFERENCES
[1] D. Singh, et al., “ Power conscious CAD tools and methodologies: A Perspective,”
Proceedings of the IEEE, Apr. 1995, pp. 570-594.
[2] V.De, et al., “Techniques for leakage power reduction,” in A. Chandrakasan, et al.,
editors, Design of High Performance Microprocessor Circuits,” IEEE Press, New York,
Chapter 3, 2001.
[3] Star-HSpice Data Sheet, Avanti Corporation, Fremont, CA, 2002.
[4] “Liberty user guide,” Synopsys, Inc. V1999.
[5] “Advanced library format for ASIC technology, cells, and blocks,” Accelera, V2.0,
December 2000.
[6] IEEE 1481 Standard for Delay & Power Calculation Language Reference Manual.
[7] J. Clement, “Electromigration reliability,” in A. Chandrakasan, et. al., editors, Design of
High Performance Microprocessor Circuits, IEEE Press, New York, Chapter 20, 2001.
[8] A. Deng, “Power analysis for CMOS/BiCMOS circuits,” in Proceedings of the 1994
International Workshop on Low Power Design, Apr. 1994.
[9] NanoSim Data Sheet, Synopsys, Inc., Mountain View, CA, 2001.
[10] AMPS Data Sheet, Synopsys, Inc., Mountain View, CA, 1999.
[11] S. M. Kang, “Accurate simulation of power dissipation in VLSI Circuits,” in IEEE
Journal of Solid State Circuits, vol. 21, Oct. 1986, pp. 889-891.
[12] SiliconSmart CR Data Sheet, Silicon Metrics Corporation, Austin, TX, 2000.
[13] S. Panda, et al., “Model and analysis for combined package and on-chip power grid
simulation,” in Proceedings of the 2000 International Symposium on Low Power
Electronics and Design, Jul. 2000.
[14] RailMill Data Sheet, Synopsys, Inc., Mountain View, CA, 2000.
[15] B. George, et al., “Power analysis and characterization for semi-custom design,” in
Proceedings of the 1994 International Workshop on Low Power Design, Apr. 1994.
[16] PowerTheater Reference Manual, Sequence Design, Inc., Santa Clara, CA, 2001.
[17] PrimePower Data Sheet, Synopsys, Inc., Mountain View, CA, 2002.
[18] S. Iman and M. Pedram, “POSE: power optimization and synthesis environment,” in
Proceedings of the 33rd Design Automation Conference, Jun. 1996.
Summary 449

[19] S. Narendra, et al., “Scaling of stack effect and its application for leakage reduction,”
Proceedings of the 2001 International Symposium on Low Power Electronics and
Design, Aug. 2001.
[20] PowerCompiler Data Sheet, Synopsys, Inc., Mountain View, CA, 2001.
[21] Low-Power Synthesis Option for BuildGates and PKS Data Sheet, Cadence, Inc., San
Jose, CA, 2001.
[22] Physical Studio Reference Manual, Sequence Design, Inc., Santa Clara, CA, 2002.
[23] Synchronous SRAM Memory Core Family, TSMC 0.1 Process Datasheet, Virage
Logic, Fremont, CA, 2000.
[24] Mars-Rail Data Sheet, Avanti Corporation, Fremont, CA, 2002.
[25] VoltageStorm-SoC Data Sheet, Simplex Solutions, Inc, Sunnyvale, CA, 2001.
[26] PowerPlanner Data Sheet, Iota Technology, Inc., San Jose, CA. 2001.
[27] Showtime Reference Manual, Sequence Design, Inc., Santa Clara, CA, 2002.
[28] Orinoco Data Sheet, Offis Systems and Consulting, GMBH, 2001.
[29] F. Catthoor, “Unified low-power design flow for data-dominated multi-media and
telecom applications,” Kluwer Academic Publishers, Boston, 2000.
[30] J. Frenkil, “Power dissipation of CMOS ASICs,” in Proceedings of the IEEE
International ASIC Conference,” Sep. 1991.
[31] D. Dobberpuhl, “The design of a high performance low power microprocessor,” in
Proceedings of the 1996 International Symposium on Low Power Electronics and
Design, Aug. 1996.
[32] B. Gieseke, et al., “A 600MHz superscalar RISC Microprocessor with out-of-order
execution,” in ISSCC Digest of Technical Papers, Feb. 1997, pp. 176-177.
[33] P. Gronowski, et al., “High performance microprocessor design,” in IEEE Journal of
Solid State Circuits, vol. 33, May 1998, pp. 676-686.
[34] P. Landman, et al., “An integrated CAD Environment for low-power design,” in IEEE
Design and Test of Computers, vol. 13, Summer 1996, pp. 72-82.
This page intentionally left blank
Chapter 15

Reconfigurable Processors — The Road to Flexible


Power-Aware Computing

J. Rabaey, A. Abnous, H. Zhang, M. Wan, V. George, V. Prabhu


University of California at Berkeley

Abstract: Energy considerations are at the heart of important paradigm shifts in next-
generation designs, especially in systems-on-a-chip era. With multimedia and
communication functions becoming more and more prominent, coming up
with low-power solutions for these signal-processing applications is a clear
must. Harvard-style architectures, as used in traditional signal processors,
incur a significant overhead in power dissipation. It is therefore worthwhile to
explore novel and different architectures and to quantify their impact on
energy efficiency. Recently, reconfigurable programmable engines have
received a lot of attention. In this chapter, the opportunity for substantial
power reduction by using hybrid reconfigurable processors will be explored.
With the aid of an extensive example, it will be demonstrated that power
reductions of orders of magnitude are attainable.

Key words: Power-aware computing, systems-on-a-chip, platform-based design, signal


processing, reconfigurable processors, agile computing.

15.1 INTRODUCTION

Systems-on-a-chip are a reality today, combining a wide range of complex


functionalities on a single die. [1]. Integrated circuits that merge core
processors, DSPs, embedded memory, and custom modules have been
reported by a number of companies. It is by no means a wild projection to
assume that a future generation design will combine all the functionality of a
mobile multimedia terminal, including the traditional computational
functions and operating system, the extensions for full multimedia support
including graphics, video and high quality audio, and wired and wireless
452 Reconfigurable Processors

communication support. In short, such a design will mix a wide variety of


architecture and circuit styles, ranging from RF and analog to high-
performance and custom digital (Figure 15.1).
Such an integration complexity may seem daunting to a designer, and
may make all our nightmares regarding performance, timing and power
come true. On the other hand, the high level of integration combined with its
myriad of design choices might be a blessing as well and can effectively help
us to address some of the most compelling energy or power-dissipation
problems facing us today. Even more, it might enable the introduction of
novel power-reducing circuit techniques that are harder to exploit in
traditional architectures.
The rest of the chapter is structured as follows. First, we discuss the
concept of platform-based design approach to systems-on-a-chip (SOC).
Next, the opportunities for power (energy) reduction at the architectural level
are discussed. This is followed by an analysis of how reconfigurable
architectures can exploit some of these opportunities. We quantify the
resulting power (energy) reductions with the aid of a real design example.
The chapter is concluded with a presentation of some of the circuit
techniques that are enabled by the reconfigurable approach.

15.2 PLATFORM-BASED DESIGN

The overall goal of electronic system design is to balance production


costs with development time and cost in view of performance, functionality
and product-volume constraints. Production cost depends mainly on the
hardware components of the product, and minimizing it requires a balance
Platform-Based DEsign 453

between competing criteria. If we think of an integrated circuit


implementation, then the size of the chip is an important factor in
determining production cost. Minimizing the size of the chip implies
tailoring the hardware architecture to the functionality of the product.
However, the cost of a state-of-the-art fabrication facility continues to rise: it
is estimated that a new high-volume manufacturing plant costs
approximately $2-3B today. NRE (Non-Recoverable Engineering) costs
associated with the design and the tooling of complex chips are growing
rapidly.
Creation of an economically feasible SoC design flow requires a
structured, top-down methodology that theoretically limits the space of
exploration, yet in doing so achieves superior results in the fixed time
constraints of the design. In recent years, the use of platforms at all of the
key articulation points in the SoC design flow has been advocated [2]. Each
platform represents a layer in the design flow for which the underlying,
subsequent design-flow steps are abstracted. By carefully defining the
platform layers and developing new representations and associated
transitions from one platform to the next, we believe that an economically
feasible “single-pass” SoC flow can be realized.
The platform concept itself is not entirely new, and has been successfully
used for years. However, the interpretation of what a platform is has been, to
say the least, confusing. In the IC domain, a platform is considered a
“flexible” integrated circuit where customization for a particular application
is achieved by “programming” one or more of the components of the chip.
Programming may imply “metal customization” (Gate arrays), electrical
modification (FPGA personalization) or software to run on a microprocessor
or a DSP. These flexible integrated circuits can be defined as members of the
Silicon implementation platform family. With SOC integration,
implementation platforms are becoming more diverse and heterogeneous,
combining various implementation strategies with diverging flexibility,
granularity, performance, and energy-efficiency properties.
For the case of software, the “platform” has been designed as a fixed
micro-architecture to minimize mask making costs but flexible enough to
warrant its use for a set of applications so that production volume will be
high over an extended chip lifetime. Micro-controllers designed for
automotive applications such as the Motorola Black Oak PowerPC are
examples of this approach. DSPs for wireless such as the TI C50 are another
one. The problem with this approach is the potential lack of optimization that
may make performance too low and size too large. A better approach is to
develop “a family” of similar chips that differ for one or more components
but that are based on the same microprocessor. The various versions of the
TI C50 family (such as the 54 and 55) are examples of such. Indeed this
454 Reconfigurable Processors

family and its “common” programmatic interface is, in our definition, a


platform; more specifically an architecture platform.
The platform-concept has been particularly successful in the PC world,
where PC makers have been able to develop their products quickly and
efficiently around a standard “platform” that emerged over the years. The PC
standard platform consists of: the x86 instruction set architecture (ISA) that
makes it possible to re-use the operating system and the software application
at the binary level; a fully specified set of busses (ISA, USB, PCI); legacy
support for the ISA interrupt controller that handles the basic interaction
between software and hardware; and a full specification of a set of I/O
devices, such as keyboard, mouse, audio and video devices. All PCs should
satisfy this set of constraints. If we examine carefully the structure of a PC
platform, we note that it is not the detailed hardware micro-architecture that
is standardized, but rather an abstraction characterized by a set of constraints
on the architecture. The platform is an abstraction of a “family” of (micro)-
architectures.
We believe that the platform paradigm will be an important component
of a future electronic system design methodology. A number of companies
have already embraced the platform-concept in the design of integrated
embedded systems. An excellent example of such is the Nexperia platform,
developed by Philips Semiconductor [3]. Nexperia serves as the standard
implementation strategy for a wide range of video products within Philips.
The platform combines a set of processors (MIPS + TriMedia) with a family
of peripheral devices, accelerators, and I/O units. Essential is also a set of
standardized busses. Depending upon the needs of a particular product
(family), the IC designer can choose to drop/add particular components. The
system designer’s interface however remains unchanged, which allows for
maximum reusability and portability. Since all components have been
extensively tested and verified, design risk is reduced substantially.
Within this framework, it is worthwhile questioning how power/energy
comes into the equation. Indeed, the choice of the right platform architecture
can have an enormous impact on the ultimate efficiency of the product.

15.3 OPPORTUNITIES FOR ENERGY MINIMIZATION

While most of the literature of the last decade has focused on power
dissipation, it is really minimization of the energy dissipation in the presence
of performance constraints that we are interested in. For real-time fixed-rate
applications such as DSP, energy and power metrics are freely
interchangeable as the rate is a fixed design constraint. In multi-task
computation, on the other hand, both energy and energy-delay metrics are
Opportunities for energy minimization 455

meaningful, depending upon the prime constraints of the intended design. In


the remainder of the text, we will focus mainly on the energy metric,
although energy-delay minimization is often considered as well. The
parameters the architecture designer can manipulate to reduce the energy
budget include supply and signal voltages, and leakage.

15.3.1 Voltage as a Design Variable

While traditionally the volatges were fixed over a complete design, it is fair
to state that more and more voltage can be considered as a parameter that
can vary depending upon the location on the die and dynamically over time.
Many researchers have explored this in recent years and the potential
benefits of varying supply voltages are too large to ignore. This is especially
the case in light with the increasing importance of leakage currents.
Matching the desired supply voltage to a task can be accomplished in
different ways. For a hardware module with a fixed functionality and
performance requirement, the preferred voltage can be set statically (e.g. by
choosing from a number of discrete voltages available on the die).
Computational resources that are subject to varying computational
requirements have to be enclosed in a dynamic voltage loop that regulates
the voltage (and the clock) based on the dialed performance level [4]. This
concept is called adaptive voltage scaling.

15.3.2 Eliminating Architectural Waste

Reducing switching capacitance typically comes down to a single, relatively


obvious task: avoid waste. A perfunctory investigation of current integrated
systems demonstrates that this is not that obvious or trivial. Energy is being
wasted at all levels of the design hierarchy, typically as a result of the chosen
implementation platform and implementation strategy:
Circuit level — correct sizing of the devices in concert with the
selection of the supply and the threshold voltages is crucial.
Architecture level — load-store architectures bring with them a huge
overhead in terms of instruction fetching, decoding, data
communication, and memory accesses.
Application level — a large gap exists between the application
designers and the underlying implementation platforms. As a result,
applications specifications and algorithms are often selected without
any clear insight in the impact on performance and energy.

In this chapter, we will mainly concentrate on the potential optimizations


at the architectural level. Only a small fraction of the energy is typically
456 Reconfigurable Processors

spent on the real purpose of the design, i.e. computation. The rest is wasted
in overhead functions such as clock distribution, instruction fetching and
decoding, busing, caching, etc. Energy-efficient design should strive to make
this overhead as small as possible, which can be accomplished by sticking to
a number of basic guidelines (the low-energy roadmap):
Match architecture and computation to a maximum extent
Preserve locality and regularity inherent in the algorithm
Exploit signal statistics and data correlations
Energy (and performance) should only be delivered on demand, i.e.
an unused hardware module should consume no energy whatsoever.
This is most easily achievable in ASIC implementations, and it hence
comes as no surprise that dedicated custom implementations yield the best
solutions in terms of the traditional cost functions such as power, delay, and
area (PDA). Indeed, it is hard to beat a solution that is optimized to perform
solely a single well-defined task. However, rapid advances in portable
computing and communication devices require implementations that must
not only be highly energy-efficient, but they must also be flexible enough to
support a variety of multimedia services and communication capabilities.
With the dramatic increase in design complexity and mask cost, reuse of
components of components has become an essential requirement. The
required flexibility dictates the use of programmable processors in
implementing the increasingly sophisticated digital signal processing
algorithms that are widely used in portable multimedia terminals. However,
compared to custom, application-specific solutions, programmable
processors often incur stiff penalties in energy efficiency and performance.
It is our point of contention that adhering strictly to the low-energy
roadmap can lead to programmable architectures that consume dramatically
less power than the traditional programmable engines. Reconfigurable
architectures that program by restructuring the interconnections between
modules are especially attractive in that respect, especially because they
allow for obtaining an adequate match between computational and
architectural granularity.

15.4 PROGRAMMABLE ARCHITECTURES — AN


OVERVIEW

For a long time, programmable architectures have been narrowly defined to


be of the load-store style processors, either in stand-alone format, or in
clusters of parallel operating units (SIMD, MIMD). The latter have
traditionally been of the homogeneous type, i.e. all processing units are of
the same type and operate on the same type of data. In recent years, it has
PROGRAMMABLE ARCHITECTURES (AN OVERVIEW 457

been observed at numerous sites that this model is too confining and that
other programmable or configurable architectures should be considered as
well. This was inspired by the success of programmable logic (FPGA) to
implement a number of computationally intensive tasks at performance
levels or costs that were substantially better than what could be achieved
with traditional processors [5]. While intrinsically not very efficient, FPGAs
have the advantage that a computational problem can be directly mapped to
the underlying gate structure, hence avoiding the inherent overhead of fixed-
word length, fixed-instruction-set processors. Configurable logic represents
an alternative architecture model, where programming is performed at a
lower level of granularity.

15.4.1 Architecture Models

Trading off between those architectures requires an in-depth understanding


of the basic parameters and constraints of the architecture, their relationship
to the application space, and the PDA (power-delay-area) cost functions.
While most studies with this respect have been either qualitative or
empirical, a quantitative approach in the style advocated by Hennessy and
Patterson [7] for traditional processor architectures is desirable. Only limited
458 Reconfigurable Processors

results in that respect have been reported. The most in-depth analysis on the
efficiency and application space of FPGAs for computational tasks was
reported by Andre Dehon [6], who derived an analytical model for area and
performance as a function of architecture parameters (such as data-path
width w, number of instructions stored per processing element c, number of
data words stored per processing element d), and application parameters
(such as word length and path length — the number of sequential
instructions required per task). Figure 15.2 plots one of the resulting
measures of the model, the efficiency — the ratio of the area of running an
application with word length on an architecture with word length
versus running it on architecture with word length w. As can be observed,
the processor excels at larger word lengths and path lengths, while the FPGA
is better suited for tasks with smaller word and path lengths.
Limiting the configurable architecture space to just those two
architectures has proven to be too restrictive and misses major opportunities
to produce dramatic reductions in the PDA space. Potential expansions can
go in a number of directions:
By changing the architecture word length w — sharing the
programming overhead over a number of bits. This increases the
PDA efficiency if the application word length matches the
architecture word length.
By changing the data storage d — this introduces the potential for
local buffering and storing data.
By changing the number of resources r — this makes it possible to
implement multiple operations on the PE by providing concurrent
units (programming in the space domain).
By changing the number of contexts c — this makes it possible to
implement multiple operations on the PE by time-multiplexing
(programming in the time domain).
By reducing the flexibility f, i.e. the type of operations that can be
performed on the processing element, i.e. making it more dedicated
towards a certain task.

Definitions:
The flexibility index of a processing element (PE) is defined as the
ratio of the number of logical operations that can be performed on
the PE versus the total set of possible logical operations. PEs that
can perform all logical operations, such as general-purpose
processors and FPGAs, have a flexibility index equal to 1 (under the
condition that the instruction memory is large enough). Dedicated
units such as adders or multipliers have a flexibility close to 0, but
tend to score considerably better in the PDA space.
PROGRAMMABLE ARCHITECTURES (AN OVERVIEW 459

The granularity index of a processor is defined as a function


g(w,d,r,c), which is a linear combination of w, d, r, and c
parameters, weighted proportionally to their cost.

A number of authors have considered various combinations of the above


parameters. Most of these studies ignore the impact of changing the
flexibility index, which can have an enormous impact on the PDA cost
function. This is illustrated in Figure 15.3, which plots the energy-efficiency
versus flexibility for different architectural choices over a number of
benchmarks. More than three orders of magnitude in efficiency can be
observed between an ASIC style solution and a fully programmable
implementation on an embedded microprocessor. Observe that these
numbers have been normalized for voltage and technology. The differences
are mostly due to the overhead that comes with flexibility. Application-
specific processors and configurable solutions improve the energy-efficiency
at the expense of flexibility. Also, better matching in granularity of
application and architecture plays a role. This dramatic reduction in energy
argues that reducing the flexibility is an option that should not be ignored
when delineating the architectural space for the future systems-on-a-chip.
This brings us to the next level of architectural modeling, the composition.
460 Reconfigurable Processors

15.4.2 Homogeneous and Heterogeneous Architectures

An overall chip architecture can be considered as a composition of


computing elements with varying degrees of granularity and flexibility. This
introduces another set of parameters into the model: homogeneity and
connectivity. An architecture is called homogeneous if the composing PEs
are identical. This has been the architecture of choice for the majority of the
multi-PE architectures so far. Examples are multi-processors and FPGAs.
Maintaining homogeneity tends to improve processor utilization and
simplifies the mapping problem. On the other hand, embedded systems seem
to embrace heterogeneity. This is mainly due to the diversity in the
computational requirements of a typical system. The multimedia terminal of
[1], for instance, combines a wide variety of functions, each with different
degrees of granularity, adaptivity, and type of operations.
Based on the above analysis, it is possible to classify systems-on-a-chip
into three categories:
Homogeneous arrays of general-purpose processing elements.
Architectures are differentiated by the granularity of the processing
elements. The only departure from the overall homogeneity is that
these parts typically will include large chunks of embedded memory.
It is for instance projected that FPGAs in the year 2010 can pack
between 2 and 5 million "real"gates and will contain more than 1
Mbyte of memory. Circuits of this class are typically used for
general-purpose computations and prototyping with limited
constraints in the PDA domain.
Application-specific combination of processing elements.
Implementations of these types are tin general geared towards a
single application. They act as board replacements, and combine
flexible components with application-specific accelerators. The
implementation of these dedicated systems only makes economical
sense for large volumes.
Heterogeneous combinations of processing elements of different
granularity and flexibility. These represent the real novelty in the
system-on-a-chip era, and can be called under the denominator of
agile computing systems [8]. The heterogeneity by nature restricts
the applicability of the circuit to a limited domain (domain-specific
processors), but at the same time yields solutions that score well in
the PDA space. The most important question to be answered by the
would-be designer is the choice of the programming elements and
their connectivity.
PROGRAMMABLE ARCHITECTURES (AN OVERVIEW 461

The remainder of this chapter will be devoted to the latter category. The
possible trade-offs will be discussed based on a comparison between some
emerging approaches. One architectural template, proposed in the Berkeley
Pleiades project, is discussed in more detail.

15.4.3 Agile Computing Systems (Heterogeneous Compute


Systems-on-a-chip)

As mentioned earlier, a range of system-on-chip implementations have


already been realized by industry. Most of them combine one or more
microprocessor sub-systems (e.g. ARM 8), DSP processors (TI5x), and
dedicated accelerator units (MPEG or audio decoders), connected through a
standard processor bus. Most SOCs, designed today for wireless
applications, fall in this space. While having the advantage of being
composed of well-understood building blocks with established software
support, the overall combination does not yield dramatic improvements in
the PDA space, is restricted in its application domain, and is hard to program
as no overlaying computational model is defined.
Another option is the array of heterogeneous processors architecture.
This approach has the advantage that the overall model is well understood,
and that system software maybe readily available. Applications can easily be
identified in the graphics, networking, and multimedia areas. A number of
companies have advocated arrays of heterogeneous VLIWs. This approach
has the advantage of providing higher performance at low-clock speeds (and
voltages) due to the extensive use of parallelism. The fully “programmable”
approach comes at a serious penalty in energy efficiency.
The combination of microprocessor and FPGA has recently achieved a
lot of attention as an attractive alternative. One premier example is the
Vertex-II Pro architecture form Xilinx [9], which combines a large FPGA
array with an embedded PowerPC microprocessor and embedded
multipliers. The FPGA can serve for a variety of functions, such as
extending the instruction set of the core processor, implementing a high-
performance dedicated compute engine, or as peripheral unit. Software
support is once again the main hurdle for this system approach to overcome.
To be successful, fast, predictable, and verifiable compilation is a necessity.
462 Reconfigurable Processors

15.5 THE BERKELEY PLEIADES PLATFORM [10]

15.5.1 Concept

The heterogeneous architectures presented above cover two, or at most three


spots in the granularity/flexibility space. For instance, the microprocessor-
FPGA combination allows for a trade-off between either very small or very
large granularity (each of which is completely flexible). Other granularity
levels can be considered besides the two extremes of the stored-program and
the gate (or transistor)-level reconfigurable modules. Reconfiguring at the
data-path-operator level has the advantage that the programming overhead
can be shared over a number of bits, hence resulting in a denser structure and
less programming overhead. Another option is to reconfigure at the
arithmetic (or functional) module level. Each of the modules at this level
represents a large-granularity, weakly-programmable function (such as
memory, multiply-accumulate, or address-generator). By configuring the
interconnect between those modules, we obtain an application-specific
arithmetic processor that is optimized for a given computational task and that
incurs only minimal programming overhead.
While it is theoretically conceivable to map virtually any computational
function onto any of the reconfiguration structures, doing so inevitably leads
to either area, power or performance penalties. It is essential match
computational and architectural granularity. Performing a multiplication on a
PGA is bound to carry a huge amount of waste. So does executing a large
dot-vector product on a microprocessor. The choice of the correct
programming model can help to enable a wide range of power-reduction
techniques, such as running at the minimum supply voltage and frequency,
exploitation of locality of reference, and temporal and spatial signal
correlations. Unfortunately, no information is available regarding the relative
performance and energy trade-off curves of the different models, which
makes it extremely hard to determine the where a when a given model is the
most appropriate. Furthermore, the information that is available is skewed
and might give an incomplete or incorrect perspective. For instance, SRAM
programmable FPGAs are rightly known to be rather slow and power-
hungry. The main reason for this is that these parts have been designed for
flexibility and ease-of-programming first with performance as a high priority
and energy not an issue at all. Observe that the same considerations were
valid for microprocessors as well just a couple of years ago. It is our belief
that high-performance low-energy programmable gate-arrays are possible
when giving up somewhat on the flexibility.
The Berkeley Pleiades PLATFORM [10] 463

15.5.2 Architecture

The Pleiades architecture, developed at UC Berkeley [10], attempts to


integrate a wider variety of reconfigurable components into a single
structure. The architecture, shown in Figure 15.4, presents a platform that
can be used to implement a domain-specific processor instance, which can
then be programmed to implement a variety of algorithms within a given
domain of interest. All instances of the architecture template share a
common set of control and communication primitives. The type and the
number of processing elements may vary; they depend upon the properties
and the computational requirements of the particular domain of interest. This
concept is totally consistent with the platform-based design concept,
described earlier. This style of reconfigurable architectures offers an the
architectural solution that allows trading off flexibility for increased
efficiency. This assertion is based on the observation that for a given domain
of algorithms (be it in signal processing or in control), the underlying
computational kernels that account for a large fraction of execution time and
energy are very similar. By executing the dominant kernels of a given
domain of algorithms on dedicated, optimized processing elements that can
execute those kernels with a minimum of energy overhead, significant
energy savings can potentially be achieved.

The architecture is centered around a reconfigurable communication


network. Communication and computation activities are coordinated via a
distributed data-driven control mechanism. Connected to the network are an
array of heterogeneous, autonomous processing elements, called satellite
processors. These could fall into any of the reconfigurable classes: a general
microprocessor core (most of the time only one of these is sufficient), a
464 Reconfigurable Processors

dedicated functional module such as a multiply-accumulator or a DCT unit,


an embedded memory, a reconfigurable data path, or an embedded PGA.
Observe that each of the satellite processors has its own autonomous
controller, although the instruction set of most of these modules is very
shallow (i.e. weakly programmable). The dedicated nature of the satellite
processors makes it possible to execute multimedia operations with minimal
overhead, hence achieving low energy per operation. The controller
overhead is minimal as the instruction set of a given satellite processor is
typically small, and very often is nothing more than a single control register.
High performance can be achieved at low voltage level through the use of
concurrency, both within a processor, or by dividing a task over multiple
processors. Finally, most of the data transfers and memory references within
a processor access only local resources and are hence energy-efficient. An
example of a co-processor for multiply-accumulate operations is shown in
Figure 15.5.

The microprocessor core plays a special role. Besides performing a


number of miscellaneous non-compute intensive or control-dominated tasks,
it configures the satellite processors and the communication network over
the reconfiguration bus. It also manages the overall control flow of the
The Berkeley Pleiades PLATFORM [10] 465

application, either in a static compiled order, or through a dynamic real-time


kernel.
The application is partitioned over the various computational resources,
based on granularity and recurrence of the computational sub-problem. For
instance, a convolution is mapped onto a combination of address generator,
memory, and multiply-accumulate processors (Figure 15.6). The connection
between these modules is set up by the control processor and remains static
during the course of the computation. The same modules can in another
phase of application be used in a different configuration to compute, for
instance, an FFT.

15.5.3 Communication Network

As mentioned, the control processor on a per-task basis configures the


communication network. Each arc in the data-flow graph is statically
assigned a dedicated link in the communication network. This ensures that
all temporal correlations in a given stream of data are preserved, and the
amount of switching activity minimized. The network itself is implemented a
a segmented bus-structure, similar to those used in FPGA architectures [11].
This approach has the advantage of being more area-efficient than a full
cross-bar, while still providing a high level of connectivity between the
processors and while minimizing the capacitance on a given bus segment.
466 Reconfigurable Processors

A major concern in the use of an MIMD architecture (which Pleiades


really is) is the synchronization of the multiple co-processors. One possible
choice would be to opt for a static approach, where all synchronization is
determined at compile time and governed by a single central controller. This
approach has the disadvantage of lacking flexibility and scalability, while its
centralized nature is counter to the low-power roadmap. An elegant and
energy-efficient solution is offered by a study of the applications at hand.
The kernels that are siphoned off to the satellites are typically computation-
oriented and are well represented by a data-flow computational model. A
processor-synchronization protocol inspired by data-flow has the advantage
of matching well with the applications at hand, and reducing the
synchronization overhead. The data-driven scheme proposed by Yeung [12]
coordinates computation and communication by providing tokens along with
the data. The presence of a token on the corresponding signaling line
indicates the availability of data and activates the destination processor.
When implemented for a homogeneous fine-grained processor array, it was
demonstrated that this approach reduces power consumption in the global
interconnect network to maximally 10% of the overall power consumption
over a range of applications. Execution of a co-processor function is
triggered by the arrival of tokens. When no tokens are to be processed at a
given module, it can go into a dormant mode and no switching activity
occurs. This scheme hence implements a demand-driven policy for
managing the switching activity in all hardware blocks. It has the further
benefits of being modular, scalable, and distributed.
The Pleiades approach has the advantage that it can exploit the right
levels of granularity and flexibility as needed by the application domain, yet
that it can be supported by a well-defined design and implementation
methodology. The mixture of a control-dominated computation model at the
application level (implemented in the OS of the control processor) and a
data-flow model at the kernel level (implemented by the co-processors)
makes it possible to device a mapping and compilation strategy that is easily
supported by today’s software environments. In fact, the state-of-the-art
design environments for embedded DSP applications propose a very similar
split between control-dominated and data-dominated tasks.

15.5.4 Benchmark Example: The Maia Chip [10]

Maia is a Pleiades processor for speech coding applications. Figure 15.7


shows the block diagram of the Maia processor. The computational core of
Maia consists of the following satellite processors: 8 address generators, 4
The Berkeley Pleiades PLATFORM [10] 467

512-word 16-bit SRAMs, 4 1024-word 16-bit SRAMs, 2 Multiply-


Accumulate Units, 2 Arithmetic/Logic Units, a low-energy embedded FPGA
unit, 2 input ports, and 2 output ports. To support speech coding efficiently,
16-bit datapaths were used in the satellite processors and the communication
network. The communication network uses a 2-level hierarchical mesh
structure. To reduce communication energy, low-swing driver and receiver
circuits are used in the communication network. Satellite processors
communicate through the communication network using the 2-phase
asynchronous handshaking protocol. Each link through the communication
network consists of a 16-bit data field, a 2-bit End-of-Vector (EOV) field,
and a request/acknowledge pair of signals for data-driven control and
asynchronous handshaking. The EOV field can have one of three values: 0,
1, 2. As a result, the control mechanism used in Maia can support scalar,
vector, and matrix data types. The I-Port and O-Port satellites are used for
off-chip data I/O functions.

The Maia processor was fabricated in a CMOS technology. The


chip contains 1.2 million transistors and measures It was
packaged in a 210-pin PGA package. Die photo of Maia is shown in Figure
15.8. With a 1.0-V supply voltage, average throughput for kernels running
on the satellite processors is 40 MHz. The ARM8 core runs at 40 MHz. The
average power dissipation of the chip is 1.5 to 2.0 mW.
468 Reconfigurable Processors

Table 15.1 shows the energy profile of the VSELP speech coding
algorithm, running on Maia. Six kernels were mapped onto the satellite
processors. The rest of the algorithm is executed on the ARM8 control
processor. The control processor is also responsible for configuring the
satellite processors and the communication network. The energy overhead of
this configuration code running on the control processor is included in the
ARCHITECTURAL INNOVATIONS ENABLE CIRCUIT-LEVEL 469

energy consumption values of the kernels. In other words, the energy values
listed in Table 1for the kernels include contributions from thesatellite
processors as well as the control processor executing configuration code.
The power dissipation of Maia when running VSELP is 1.8 mW. The lowest
power dissipation reported in the literature to date is 17 mW for a
programmable signal processor executing the Texas Instruments
TMS320LC54x instruction set, implemented in a CMOS process,
running at 63 MHz with a 1.0-V supply voltage [12]. The energy efficiency
of this reference processor is whereas the energy efficiency
of Maia is which corresponds to an improvement by a factor of
six.

15.6 ARCHITECTURAL INNOVATIONS ENABLE


CIRCUIT-LEVEL OPTIMIZATIONS

While Pleiades attempts to address the energy problem at the architectural


level, the proposed structure can have some interesting ramifications on the
circuit level as well, some of which are presented in the paragraphs below.

15.6.1 Dynamic Voltage Scaling

In the low-energy roadmap, it was outlined that adjusting the supply voltage
to the computational requirements can lead to dramatic energy savings. The
distributed nature of the Pleiades architecture makes it naturally suited to
exploit some of the opportunities offered by dynamic voltage scaling. Most
of the co-processors perform a single task with a well-defined workload. For
these, a single operational voltage, carefully chosen to meet the
470 Reconfigurable Processors

computational requirements, is sufficient. The typical operating voltage of


the Pleiades satellite processors is set to 1.5 V.
The control processor on the other hand can experience workloads
varying over a wide range depending upon the task being performed:
operating system, reconfiguration, compute function, or background.
Varying the supply voltage to accommodate these changes in computational
requirements may greatly reduce the energy consumed by the core (which is
after all still a sizable fraction of the total). Observe that a change in the
supply voltage also varies the clock frequency of the processor.

15.6.2 Reconfigurable Low-swing Interconnect Network

The proposed heterogeneous architecture is centered around communications


between a core processor and a set of co-processors. This network still
consumes a sizable amount of energy, even though the static data-driven
nature of the interconnect helps to reduce the consumption (see above). The
design is further complicated by the fact that each (co-)processor may have
its own operating frequency and supply voltage, some of which might be
dynamically varying. A solution is offered extending the data-driven
protocol all the way down to the circuit level. The challenge at the circuit
layer is to provide a signaling scheme that is not depending upon the
operating frequency of any of the connecting modules. A simple two-phase
signaling scheme is presented in Figure 15.9. As can be observed, individual
processors operate on a locally generated clock and hence follow the
synchronous design approach. The operation of the clock is enabled by the
presence of a token event on the data receiver, i.e. when no data is available
the local clock is automatically turned off. The period of the clock generator
is programmable and can be adapted to the required performance or
operation voltage. The combination of the chosen two-phase signaling and
the local clock generation ensures that synchronization failures cannot occur!
The resulting globally asynchronous - locally synchronous protocol might
not only be attractive in the Pleiades model, but could be useful for other
system-on-a-chip distributed architectures as well. From an energy
perspective, its major attraction is the elimination of a single global clock. A
potential disadvantage is a slight drop in performance since the local clock
generators have to provide a build-in safety margin to ensure that no timing
hazards can occur as a result of process variations.
To reduce the energy even further, a reduced swing signalling approach
can be adopted for the communication network with standardized
receivers/transmitters at the processor interfaces acting as level converters.
For more information about this network, we refer the interested reader to
[13].
Summary 471

15.7 SUMMARY

Agile computing architectures, consisting of a heterogeneous collection of


computational engines ranging from microprocessors to embedded
programmable logic, play a dominant role in the system-on-a-chip era. They
combine the advantages of programmability, hence leveraging off the design
cost of a complex part over a number of designs and providing at the same
time adaptivity an flexibility, with the energy efficiency of more dedicated
architectures. Selecting the correct model of computation for a given
application (or application kernel) is the single most important prescription
of the low-energy roadmap, presented in this chapter.
The platform-based design methodology provides a means of reusing the
extensive effort that goes into the development of one of these architectures
and the accompanying software support tools over a wide range of
applications, typically located within a single application domain (such as
wireless, automative, or multimedia). The Pleiades project demonstrated
how the choice of the right platform architecture with a judicious choice of
the computational modules and the interconnect network can lead to very
472 Reconfigurable Processors

low-power implementations of computationally intensive functions, while


maintaining the necessary flexibility to cover a range of applications.

REFERENCES
[1] J. Borel, “Technologies for multimedia systems on a chip,” Proc. IEEE ISSCC
Conference 1997, pp. 18-21, San Francisco, February 1997.
[2] J. Rabaey and A. Sangiovanni-Vincentelli, “System-on-a-chip – a platform perspective,“
Keynote presentation, Proceedings Korean Semiconductor Conference, February 2002.
[3] T. Claessen, “First time right silicon, but... to the right specification,” Keynote Design
Automation Conference 2000, Los Angeles.
[4] T. Burd, T. Pering, A. Stratakos, and R. Brodersen, “A dynamic voltage scaled
microprocessor system,” IEEE ISSCC Dig. Tech. Papers, pp. 294-295, Feb. 2000.
[5] J. Villasenor and W. Mangione-Smith, “Configurable Computing,” Scientific American,
pp. 66-73, June 1997.
[6] DeHon, “Reconfigurable Architectures for general purpose computing,,” Technical
Report 1586, MIT Artificial Intelligence Laboratory, September 1996.
[7] Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, Morgan
Kaufman Publishers, san Mateo, 1990.
[8] Silicon after 2010, DARPA ISAT study group, August 1997.
[9] Virtex-II Pro Platform FPGAs,
http://www.xilinx.com/xlnx/xil_prodcat_landing_page.jsp?title=Virtex-II+Pro+FPGAs,
Xilinx, Inc.
[10] H. Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, and J. Rabaey, “A 1 V
heterogeneous reconfigurable processor ic for baseband wireless applications,” Proc.
ISSCC, pp, 68-69, February 2000.
[11] S. Hauck et al, “Triptych - an FPGA architecture with integrated logic and routing,”
Proc. 1992 Brown/MIT Conference, pp 26-32, March 1992.
[12] Yeung and J. Rabaey, “A 2.4 GOPS data-driven reconfigurable multiprocessor IC for
DSP,” Proc. IEEE ISSCC Conference 1995, pp. 108-109, San Francisco, 1995.
[13] H. Zhang, V. George, J. Rabaey, “Low-swing on-chip signaling techniques:
effectiveness and robustness,” IEEE Transactions on VLSI Systems, vol. 8 (no.3), pp.
264-272, June 2000.
Chapter 16
Energy-Efficient System-Level Design

Luca Benini1 and Giovanni De Micheli2


1 2
Università di Bologna; Stanford University

Abstract: The complexity of current and future integrated systems requires a paradigm
shift towards component-based design technologies that enable the integration
of large computational cores, memory hierarchies and communication
channels as well as system and application software onto a single chip.
Moving from a set of case studies, we give an overview of energy-efficient
system- level design, emphasizing a component-based approach.

Key words: Embedded systems, memory hierarchy, network-on-chip, chip multiprocessor,


system software, application software, power management.

16.1 INTRODUCTION

A system is a collection of components whose combined operation provides


a useful service. We consider specifically systems on chips (SoCs). Such
systems consist of hardware components integrated on a single chip and
various software layers. Hardware components are macro-cells that provide
information processing, storage, and interfacing. Software components are
programs that realize system and application functions.
When analyzing current SoC designs, it is apparent that systems are
described and realized as collections of components. Indeed, to date, there is
limited use of behavioral synthesis at the system level. System
implementation by component interconnection allows designers to realize
complex functions while leveraging existing units and/or design
technologies, such as synthesis, on components whose size is much smaller
than the system itself.
Sometimes, system specifications are required to fit into specific
interconnections of components called hardware platforms. Thus, a
hardware platform, which is a restriction of the design space, may facilitate
474 Energy-Efficient System-Level Design

system realization because it reduces the number of design options and


fosters the use and reuse of standard components. Expertise with designing
systems on a known platform is also a decisive factor in reducing design
time and in increasing designers' confidence in success.
System design consists of realizing a desired functionality while
satisfying some design constraints. Broadly speaking, constraints limit the
design space and relate to the major design trade-off between quality of
service (QoS) versus cost. QoS is closely related to performance, i.e., the
number of tasks that can be computed in a time window (system
throughput), as well as the time delay to complete a task (latency). QoS
relates also to the system dependability, i.e., to a class of specific system
figures (e.g., reliability, availability, safety) that measure the ability of the
system to deliver a service correctly, within a given time window and at any
time. Design cost relates to design and manufacturing costs (e.g., silicon
area, testability) as well as to operation costs (e.g., power consumption,
energy consumption per task).
In recent years, the design trade-off of performance versus power
consumption has received large attention because of: (i) the large number of
mobile systems that need to provide services with the energy releasable by a
battery of limited weight and size, (ii) the technical feasibility of high-
performance computation because of heat extraction, and (iii) concerns
about operating costs caused by electric power consumption in large systems
and the dependability of systems operating at high temperatures because of
power dissipation. Dependability measures will be extremely relevant in the
near future because of the use of SoCs in safety-critical applications (e.g.,
vehicular technologies) and in devices that connect humans with services
(e.g., portable terminals used to manage finances and working activities).
Recent design methodologies and tools have been addressing the problem
of energy-efficient design, aiming at providing a high-performance
realization while reducing its power dissipation. Most of these techniques, as
described in the previous chapters, address system components design. The
objective of this chapter is to describe current techniques that address
system-level design.

16.2 SYSTEMS ON CHIPS AND THEIR DESIGN

We attempt to characterize SOC designs based on trends and technologies.


Electronic systems are best implemented on a single chip because input-
output pins are a scarce resource, and because on-chip interconnect is faster
and more reliable while overall cost is usually smaller. At present, it is
possible to integrate opto-electronic units on chip (e.g., charge-coupled
Systems on Chips and Their Design 475

device cameras) and mechanical elements (e.g., accelerometers) even though


systems with such components go beyond the scope of this chapter. In some
domains, e.g., digital telephony, there is a definite trend to cluster all
electronics of a product on a single die.
Current near-future electronic technologies provide designers with an
increasingly larger number of transistors per chip. Standard, CMOS silicon-
based technologies with feature size around 100nm are considered here.
Such technologies support half a billion transistor chips of a few square
centimeters in size, according to the international technology semiconductor
roadmap (ITRS). As device sizes will further shrink to 50nm by the end of
the decade, chips will accommodate up to four billion transistors. Whereas
the increased amount of active devices will support increasingly more
complex design, chip power dissipation will be capped around 175W
because of packaging limitations and costs. Thus, the computing potential is
limited by energy efficiency.
At the same time, the design of large (i.e., billion transistor) chips will be
limited by the ability of humans and computer-aided design (CAD) tools to
tame their complexity. The million-transistor chip frontier was overcome by
using semi-custom technologies and cell libraries in the 1990s. Billion-
transistor chips will be designed with methodologies that limit design
options and leverage both libraries of very large scale components and
generators of embedded memory arrays.
Such library components are typically processors, controllers, and
complex functional units (e.g., MPEG macro-cells). System designers will
accept such components as basic building blocks as they are used to
accepting NAND and NOR gates without questioning their layout. At the
same time, successful component providers are expected to design reliable
and flexible units that can interact with others under varying operating
conditions and modes. Post-design, possibly in situ software (or
programmable hardware) configuration of these components, will play a
major role in achieving versatile components.
When observing any SoC layout, it is simple to recognize large memory
arrays. The ability to realize various types of embedded memories on chip
and the interspersion of storage and computing units are key to achieving
high-performance. The layout of embedded memory arrays is automatically
generated by physical synthesis tools and can be tailored in size, aspect ratio,
and speed.
The distinguishing features of the upcoming SoCs relate directly to the
features and opportunities offered by semiconductor technology. Namely,
SoCs will display many processing elements (i.e., cores) and memory arrays.
Multi-processing will be the underlying characteristic of such chips. Thus
SoC technology will provide for both the implementation of multi-
476 Energy-Efficient System-Level Design

processing computing systems and application-specific functions. The latter


class of systems is likely to be large and will be the driving force for SoC
technology. Indeed, embedded systems will be realized by SoCs realizing a
specific function, e.g., vehicular control, processing for wireless
communication, etc. Application specific SoCs will be characterized by the
presence of processing units running embedded software (and thus emulating
hardware functions) and by asymmetric structures, due to the diversity of
functions realized by the processing elements and their different storage
requirements.
The presence of several, possibly application-specific, on-chip storage
arrays presents both an opportunity and a design challenge. Indeed, the use
of hierarchical storage that exploits spatial and temporal locality by
interspersing processing elements and storage arrays is key to achieving high
throughput with low latency [1][2][3]. The sizing and synthesis of embedded
storage arrays poses new challenges, because the effectiveness of multi-
processing is often limited by the ability to transfer and store information.
SoCs will generate large data traffic on chip; the energy spent to process data
is likely to be dwarfed by the energy spent to move and store data. Thus, the
design of the on-chip communication and storage systems will be key in
determining the energy/performance trade-off points of an implementation.
The use of processing cores will force system designers to treat them as
black boxes and renounce the detailed tuning of their performance/energy
parameters. Nevertheless, many processing elements are designed to operate
at different service levels and energy consumption, e.g., by controlling their
operation frequency and voltage. Thus system designers will be concerned
with run-time power management issues, rather than with processing
element design.
As a result, the challenging issues in system-level design relate to
designing the storage components and the interconnect network of SoCs. At
the same time, designers must conceive run-time environments that manage
processing elements, memory, and on-chip network to provide for the
workload-dependent operating conditions, which yield the desired quality of
service with minimal energy consumption. In other words, SoCs will require
dedicated operating systems that provide for power management.
The overall design of system and application software is crucial for
achieving the overall performance and energy objectives. Indeed, while
software does not consume power per se, the software execution causes
energy consumption by processing elements, storage arrays, and
interconnect network. It is well known that software design for a SoC is at
least as demanding as hardware design. For this reason, software design
issues will be covered in this chapter.
SOC Case Studies 477

The remaining of this chapter is organized as follows. First a set of recent


SoC examples is considered to motivate this survey. Next the storage array
and interconnect network design on chip is address. The chapter concludes
with a survey of software design techniques, for both system and application
software.

16.3 SOC CASE STUDIES

This section analyzes three SoC designs from an energy-centric perspective.


It is organized in order of tightening power and cost constraints, starting
from a 3D graphics engine for game consoles, moving to a MPEG4 encoder-
decoder for 3G wireless terminals, and concluding with an audio recorder for
low-end consumer applications. Clearly, this survey gives a very partial view
of an extremely variegated landscape, but its purpose is to focus on the key
design challenges in power-constrained integrated system design and to
enucleate system design guidelines that have lead to successful industrial
implementations.

16.3.1 Emotion Engine

The Emotion Engine [4][5] was designed by Sony and Toshiba to support 3-
D graphics for the PlayStation 2 game console. From a functional viewpoint,
the design objective was to enable real-time synthesis of realistic animated
scenes in three dimensions. To achieve the desired degree of realism,
physical modeling of objects and their interactions, as well as 3-D geometry,
transformation are required. Power budget constraints are essentially set by
cost considerations: the shelf price of a game console should be lower than
US$ 500, thus ruling out expensive packaging and cooling. Furthermore,
game consoles should be characterized by the low cost of ownership,
robustness with respect to a wide variety of operating conditions, and
minimal maintenance. All of these requirements conflict with high power
dissipation. These challenges were met by following two fundamental design
guidelines: (i) integration of most of the critical communication, storage, and
computation on a single SoC, and (ii) architectural specialization for a
specific class of applications.
The architecture of the Emotion Engine is depicted in Figure 16.1. The
system integrates three independent processing cores and a few smaller I/O
controllers and specialized coprocessors. The main CPU, the master
controller, is a superscalar RISC processor with a floating-point
coprocessor. The other two cores are floating-point vector processing units.
The first vector unit, VPU0, performs physical modeling computations,
478 Energy-Efficient System-Level Design

while the second, VPU1, is dedicated to 3-D geometry computation. These


two functions are allocated to two different vector units because their
schedules are conflicting. Physical modeling is performed under the control
of the main CPU, and it is scheduled quite irregularly and unpredictably. In
contrast, geometry computations are performed in response to requests from
the rendering engine, which are spaced in equal time increments.

The main CPU is a two-way superscalar RISC core implementing the


MIPS III instruction set, plus 107 new SIMD multimedia instructions. The
core has 32 128-bit registers and two 64-bit integer units. Instruction and
data caches are two-way set associative, 16-KB and 8-KB, with one-cycle
access. Local data storage is also supported by a 16-KB scratch-pad RAM
(one-cycle access). The vector units VPU0 and VPU1 have similar micro-
architectures. However, VPU0 works as a coprocessor of the main CPU,
while VPU1 operates independently. The vector units have a four-way
SIMD organization. Instruction memory is 64-bits wide and its size is 16-KB
(for VPU1, 4-KB for VPU0). To provide single-cycle data feed to the
floating-point units, four pipelined buffers are instantiated within the VPUs.
The quad-buffer appears as a 16-KB (for VPU1, 4-KB for VPU0), 4-ported
memory.
Communication is critical for system performance. VPU1 works
independently from the processor and produces a very large amount of data
for the external rendering engine. Therefore, there is a dedicated connection
SOC Case Studies 479

and I/O port between VPU1 and the rendering engine. In contrast, VPU0
receives data from the CPU (as a coprocessor). For this reason data transfer
from/to the unit is stored in the CPU's scratch-pad memory and transferred to
VPU0 via DMA on a shared, 128-bit interconnection bus. The bus supports
transfers among the three main processors, the coprocessors, and I/O blocks
(e.g., for interfacing with high-bandwidth RDRAM).
The Emotion Engine was fabricated in technology with
drawn gate length for improved switching speed. The CPU and the VPUs are
clocked at 250MHz. External interfaces are clocked at 125 MHz. Die size is
The chip contains 10.5 million transistors. The chip can
sustain 5 GFLOPs. With power supply the power consumption
is 15 W. Clearly, such a power consumption is not adequate for portable,
battery-operated equipment; however it is much lower than that of a general-
purpose microprocessor with similar FP performance (in the same
technology).
The energy efficiency of the Emotion Engine stems form several factors.
First it contains many fast SRAM memories, providing adequate bandwidth
for localized data transfers but not at the high energy cost implied by cache
memories. On the contrary, instruction and data caches have been kept
small, and it is up to the programmer to develop tight inner loops that
minimize misses. Second, the architecture provides an extremely aggressive
degree of parallelism without pushing the envelope for maximum clock
speed. Privileging parallelism with respect to sheer speed is a well-known
low-power design technique [6]. Third, parallelism is explicit in hardware
and software (the various CPUs have well-defined tasks), and it is not
compromised by centralized hardware structures that impose unacceptable
global communication overhead. The only global communication channel
(the on-chip system bus) is bypassed by dedicated ports for high-bandwidth
point-to-point communication (e.g., between VPU1 and the rendering
hardware). Finally, the SoC contains many specialized coprocessors for
common functions (e.g., MPEG2 video decoding), which unloads the
processors and achieves very high energy efficiency and locality.
Specialization is also fruitfully exploited in the micro-architecture of the
programmable processors, which natively support a large number of
application-specific instructions.

16.3.2 MPEG4 Core

In contrast with the Emotion Engine, the MPEG4 video codec SoC described
by Takahashi et al. [7] has been developed specifically for the highly power-
constrained mobile communications market. Baseband processing for a
multimedia-enabled 3G wireless terminal encompasses several complex
480 Energy-Efficient System-Level Design

tasks that can, in principle, be implemented by multiple ICs. However, it is


hard to combine many chips within the small body of a mobile terminal, and,
more importantly, the high-bandwidth I/O interfaces among the various ICs
would lead to excessive power consumption. For this reason, Takahashi et
al. opted for an SoC solution that integrates most of the digital baseband
functionality. The SoC implements a video codec, a speech codec or an
audio decoder, and multiplexing and de-multiplexing between multiple video
and speech/audio streams.

Video processing is characterized by large data streams from/to memory,


and memory space requirements are significant. For this reason, the MPEG4
video codec has been implemented in an embedded-DRAM process. The
abstracted block diagram on the SoC is shown in Figure 16.2. The chip
contains 16-Mb embedded DRAM and three signal processing cores: a video
core, a speech/audio core, and a stream-multiplexing core. Several peripheral
interfaces (camera, display, audio, and an external CPU host for
configuration) are also implemented on-chip.
Each of the major signal processing cores contains a 16-bit RISC
processor and dedicated hardware accelerators. The system is a three-way
asymmetric on-chip multiprocessor. Data transfers among the three
processors are performed via the DRAM. A virtual FIFO is configured on
the DRAM for each processor pair. The size of the FIFOs can be changed by
SOC Case Studies 481

the firmware of each core. The communication network is organized as a set


of point-to-point channels between processors and DRAM. An arbitration
unit regulates access to the DRAM, based on DMA. Most of the traffic on
the channels is caused by cache and local memory refills issued by the three
processing cores. Communication among processors is sporadic.
The video processing core of the SoC contains a multimedia-enhanced
RISC processor with a 4-Kb direct mapped instruction cache and a 8-Kb
data cache. The video processor also includes several custom coprocessors:
2 DCT coprocessors, a motion compensation block, two motion estimation
blocks, and a filter block. All hardware accelerators have local SRAM
buffers for limiting the number of accesses to the shared DRAM. The total
SRAM memory size is 5.3 Kb. The video processing core supports
concurrent execution in real time of one encoding thread and up to four
decoding threads. The audio core has a similar organization. It also contains
an RISC processor with caches, but it includes different coprocessors. The
multiplexing core contains a RISC processor and a network interface block,
and it handles tasks without the need for hardware accelerators.
The MPEG4 core targets battery-powered portable terminals, hence, it
has been optimized for low power consumption at the architectural, circuit,
and technology level. Idle power reduction was a primary concern.
Therefore, clock gating is adopted throughout the chip; the local clock is
automatically stopped whenever processors or hardware accelerators are
idle. Shutdown is also supported at a coarser granularity: all RISC
processors support sleep instructions for explicit, software-controlled
shutdown, with interrupt-based wake-up. Active power minimization is
tackled primarily through the introduction of embedded DRAM, which
drastically reduces IO, bus, and memory access energy. Memory tailoring
reduces power by 20% with respect to a commodity-DRAM solution. Page
and word size have been chosen to minimize redundant data fetch and
transfer, and specialized access modes have been defined to improve latency
and throughput.
To further reduce power, the SoC was designed in a variable-
threshold CMOS technology with In active mode, the threshold
voltage of transistors is 0.55 V. In standby mode it is raised through body-
bias to 0.65 V to reduce leakage. The chip contains 20.5 million transistors,
chip area is The 16-Mb embedded DRAM occupies
roughly 40% of the chip. The chip consumes 260 mW at 60 MHz. Compared
to a previous design, with external commodity DRAM and separate video
and audio processing chips, power is reduced by roughly a factor of four.
Comparing the MPEG4 core with the Emotion Engine from a power
viewpoint, one notices that the second SoC consumes roughly 60 times less
power than the first one at a comparable integration level. The differences in
482 Energy-Efficient System-Level Design

speed and voltage supply account for a difference in power consumption of,
roughly, a factor of 2, which becomes a factor of 4 if one discounts area (i.e.,
focuses on power density). The residual 15 times difference is due to the
different transistor usage (the MPEG4 core is dominated by embedded
DRAM, which has low power density), and to architecture, circuit, and
technology optimizations. This straightforward comparison convincingly
demonstrates the impact of power-aware system design techniques and the
impressive flexibility of CMOS technology.

16.3.3 Single-chip Voice Recorder

Digital audio is a large market where system cost constraints are extremely
tight. For this reason, several companies are actively pursuing single-chip
solutions based on embedded memory for the on-chip storage of sound
samples [8] [9]. The main challenges are the cost per unit area of
semiconductor memory, and the power dissipation of the chip, which should
be as low as possible to reduce the cost of batteries (e.g., primary Lithium
vs. rechargeable Li-Ion).
The single-chip voice recorder and player developed by Borgatti and
coauthors [10] stores recorded audio samples on embedded FLASH
memory. The chip was originally implemented in technology with
3.0 V supply, and it is a typical example of an SoC designed for a single
application. The main building blocks (Figure 16.3) are: a microcontroller
unit (MCU), a speech coder and decoder, and an embedded FLASH
memory. A distinguishing feature of the system is the use of a multi-level
storage scheme to increase the speech recording capacity of the FLASH.
Speech samples are first digitized then compressed with a simple waveform
coding technique (adaptive-differential pulse-code modulation) and finally
stored in FLASH memory, 4-bits per cell.
A 4-bits per cell density requires 16 different thresholds for the FLASH
cells. Accurate threshold programming and readout requires mixed-signal
circuitry in the memory write and read paths. The embedded FLASH macro
contains 8 Mcells. It is divided into 128 sectors that can be independently
erased. Each sector contains 64-K cells, which can store 32 Kbytes in
multilevel mode. Memory read is performed though an 8-bit, two-step
analog-to-digital converter.
SOC Case Studies 483

Besides the multilevel FLASH memory, the other main components of


the SoC are the 8-bit MCU, the ADCPM speech codec, and the 16-bit on-
chip bus. The core interfaces to two 32kB embedded RAM blocks (one for
storing data and the other for executable code and data). The two blocks are
split into 16 selectively accessed RAM modules to reduce power
consumption. The executable code is downloaded to program RAM from
dedicated sectors of the FLASH macro though 16-bit DMA transfers on the
on-chip bus. A few code blocks (startup code, download control code, and
other low-level functions) are stored in a small ROM module (4-kB).
The speech codec is a custom datapath block implementing ITU-T G.726
compression (ADPCM). Its input/output ports are in PCM format for
directly interfacing to a microphone and a loudspeaker. At a clock speed of
128 kHz, a telephone-quality speech signal can be compressed at one of four
selectable bit rates (16-40 kB/s). The compressed audio stream is packed in
blocks of 1 kB using two on-chip RAM buffers (in a two-phase fashion).
This organization guarantees that samples can be transferred to FLASH in
blocks, at a much higher burst rate than the sample rate.
The on-chip bus is synchronous and 16-bits wide, and it supports
multiple masters and interrupts. A bus arbiter manages mutual exclusion and
resolves access conflicts. A static priority order is assigned to all bus masters
at initialization time, but it can be modified through a set of dedicated
484 Energy-Efficient System-Level Design

signals. The on-chip bus can be clocked at different speeds (configured


through a software accessible register). A dedicated clock clocks each block
at a different speed. All clocks are obtained by dividing an externally
provided 16-32 MHz clock. Clock gating was used extensively to reduce the
power consumption of idle sub-circuits.
The chip is fabricated in a common-ground NOR embedded
FLASH process. The chip area is and it has only 26 logically
active pins. Standby power is less than 1mW. Peak power during recording
is 150 mW and 110 mW during play. The average power increases with
higher bit rates, but it is generally much smaller than peak power (e.g.,
75mW for recording at 24 kbps).
The single-chip recorder demonstrates power minimization principles
that have not been fully exploited in the SoCs examined in the previous
subsections. The use of application-specific processing units is pushed one
step further. Here, the programmable processor has only control and
coordination functions. All computationally expensive data processing is
farmed off to a specialized datapath block. An additional quantum leap in
energy efficiency is provided by mixed-signal or analog implementation of
key functional blocks. In this chip, analog circuits are used to support 16-bit
per cell programming density in the embedded FLASH memory. The 16-fold
density increase for embedded memory represents a winning point from the
energy viewpoint as well.

16.4 DESIGN OF MEMORY SYSTEMS

The SoCs analyzed in the previous section demonstrate that today's


integrated systems contain a significant amount of storage arrays. In many
cases the fraction of silicon real estate devoted to memory is dominant, and
the power spent in accessing memories dictates the overall chip power
consumption. The general trend in SoC integration is toward increasing
embedded memory content [11]. It is reported that, on average, 50% of the
transistors in an SoC designed in 2001 are instantiated within memory
arrays. This percentage is expected to grow to 70% by 2003 [12], In view of
this trend it is obvious that energy-efficient memory system design is a
critical issue.
The simplest memory organization, the flat memory, assumes that data is
stored in a single, large array. Even in such a simplistic setting, sizing
memory arrays is not trivial. Undersized memories penalize system
performance, while oversized memories cost in terms of silicon area as well
as performance and power, because access time and power increase
monotonically with memory size [14][13].
Design of Memory Systems 485

The most obvious way to alleviate memory bottlenecks is to reduce the


storage requirements of the target application. To this goal, designers can
reduce memory requirements by exploiting the principle of temporal
locality, i.e., trying to reuse the results of a computation as soon as possible,
in order to reduce the need for temporary storage. Other memory-reduction
techniques aim at finding efficient data representations that reduce the
amount of unused information stored in memory. Storage reduction
techniques cannot completely remove memory bottlenecks, mainly because
they try to optimize power and performance indirectly as a by-product of the
reduction of memory size. As a matter of fact, memory size requirements of
system applications have steadily increased over time.
From the hardware design viewpoint, memory power reduction has been
pursued mainly through technology and circuit design and through a number
of architectural optimizations. While technology and circuit techniques are
reviewed in detail in previous chapters, architectural optimizations, which
rely on the idea of overcoming the scalability limitation intrinsic of flat
memories, are focused on here. Indeed, hierarchical memories allow the
designer to exploit the spatial locality of reference by clustering related
information into the same (or adjacent) arrays.

16.4.1 On-chip Memory Hierarchy

The concept of a memory hierarchy, conceptually depicted in Figure 16.4, is


at the basis of most on-chip memory optimization approaches. Lower levels
in the hierarchy are made of small memories, tightly coupled with
processing units. Higher hierarchy levels are made of increasingly larger
memories, placed relatively far from computation units, and possibly shared.
When looking at the hierarchical structure of computational and storage
nodes, the distance between a computation unit and a storage array
486 Energy-Efficient System-Level Design

represents the effort needed to fetch (or store) a data unit from (to) the
memory. The main objective of energy-efficient memory design is to
minimize the overall energy cost for accessing memory within performance
and memory size constraints. Hierarchical organizations reduce memory
power by exploiting non-uniformity (or locality) in access.

Memory optimization techniques can be classified into three categories:

Memory hierarchy design. Given a dynamic trace of memory


accesses, obtained by profiling an application, derive a
customized memory hierarchy.
Computation transformation. Given a fixed memory hierarchy,
modify the storage requirements and access patterns of the target
computation to optimally match the given hierarchy.
Synergistic memory and computation optimization. Concurrently
optimize memory access patterns and memory architecture.

Memory-hierarchy design is considered next. Computation


transformations are software-oriented techniques (see Section 5). For a
comprehensive survey of the topic, with special emphasis on synergistic
techniques, refer to [6][15].
When comparing time and energy per access in a memory hierarchy, one
can observe that they both increase with the move from low to high
hierarchy levels. One may be led to conclude that a low-latency memory
architecture will also be a low-power architecture and that memory
performance optimization implies power optimization. This conclusion is
often incorrect for three main reasons. First, even though both power and
performance increase with memory size and memory hierarchy levels, they
do not increase by the same amount. Second, performance is a worst-case
quantity (i.e., intensive), while power is an average-case quantity (i.e.,
extensive). Thus, memory performance can be improved by removing a
memory bottleneck on a critical computation, but this may be harmful for
power consumption, the impact of a new memory architecture on all
memory accesses, not only the critical ones, needs to be considered. Third,
several circuit-level techniques actually trade shorter access time for higher
power (and vice versa) at a constant memory size. The following example,
taken from [16], demonstrates how energy and performance can be
contrasting quantities.

Example 1 The memory organization options for a two-level memory


hierarchy (on-chip cache and off-chip main memory) explored in [16] are
the following: (i) cache size, ranging from 16 bytes to 8KB (in powers of
Design of Memory Systems 487

two); (ii) cache line size, from 4 to 32, in powers of two; (iii) associativity (1,
2, 4, and 8); and (iv) off-chip memory size, from 2Mbit SRAM, to 16Mbit
SRAM.
The exhaustive exploration of the cache organization for minimum
energy for an MPEG decoding application results in an energy-optimal
cache organization with cache size 64 bytes, line size 4 bytes, 8-way set
associative. Notice that this is a very small memory, almost fully associative
(only two lines). For this organization, the total memory energy is
and the execution time is 142,000 cycles. In contrast, exploration for
maximum performance yields a cache size of 512 bytes, a line size of 16
bytes, and is 8-way set associative. Notice that this cache is substantially
larger than the energy-optimal one. In this case, the execution time is
reduced to 121,000 cycles, but the energy becomes
One observes that the second cache dominates the first one for size, line
size, and associativity; hence, it has the larger hit rate. This is consistent
with the fact that performance strongly depends on miss rate. On the other
hand, if external memory access power is not too large with respect to cache
access (as in this case), some hit rate can be traded for decreased cache
energy. This justifies the fact that a small cache with a large miss rate is
more power-efficient than a large cache with a smaller miss rate.
The example shows that energy cannot generally be reduced as a
byproduct of performance optimization. On the other hand, architectural
solutions originally devised for performance optimization are often
beneficial in terms of energy. Generally, when locality of access is
improved, both performance and energy tend to improve. This fact is heavily
exploited in software optimization techniques.

16.4.2 Explorative Techniques

Several recently proposed memory optimization techniques are explorative.


They exploit the fact that the memory design space can usually be
parameterized and discretized, to allow for an exhaustive or near-exhaustive
search. Most approaches assume a memory hierarchy with one or more
levels of caching and, in some cases, an off-chip memory. A finite number
of cache sizes and cache organization options are considered (e.g., degree of
associativity, line size, cache replacement policy, as well as different off-
chip memory alternatives--number of ports, available memory cuts). The
best memory organization is obtained by simulating the workload for all
possible alternative architectures. The various approaches mainly differ in
the number of hierarchy levels that are covered by the exploration or the
number of available dimensions in the design space. Su and Despain [17],
Kamble and Ghose [18], Ko and Balsara [19], Bahar at al. [20], and Shiue
488 Energy-Efficient System-Level Design

and Chakrabarti [16] focus on cache memories. Zyuban and Kogge [21]
study register files; Coumeri and Thomas [22] analyze embedded SRAMs;
Juan et al. [23] study translation look-aside buffers.
Example 16.1 has shown an instance of a typical design space and the
result of the relative exploration. An advantage of explorative techniques is
that they allow for concurrent evaluation of multiple cost functions such as
performance and area. The main limitation of the explorative approach is
that it requires extensive data collection, which provides a posteriori insight.
In order to limit the number of simulations, only a relatively small set of
architectures can be tested and compared.

16.4.3 Memory Partitioning

Within a hierarchy level, power can be reduced by memory partitioning.


The principle of memory partitioning is to sub-divide the address space and
to map blocks to different physical memory banks that can be independently
enabled and disabled. Arbitrary fine partitioning is prevented due to the fact
that a large number of small banks is area inefficient and imposes a severe
wiring overhead, which tends to increase communication power and
performance.
Partitioning techniques can be applied at all hierarchy levels, from
register files to off-chip memories. Another aspect is the “type” of
partitioning, such as physical or logic partitioning. Physical partitioning
strictly maps the address space onto different, non-overlapping memory
blocks. Logic partitioning exploits some redundancy in the various blocks of
the partition, with the possibility of addresses that are stored several times in
the same level of hierarchy.
A physically-partitioned memory is energy-efficient mainly for two
reasons. First, if accesses have high spatial and/or temporal locality,
individual memory banks are accessed in bursts. Burst access to a single
bank is desirable because idle times for all other banks are long, thereby
amortizing the cost of shutdown [24]. Second, energy is saved because every
access is on a small bank as opposed to a single large memory [17]. For
embedded systems designed with a single application target, application
profiling can be exploited to derive a tailored memory partition, where small
memory banks are tightly fitted on highly-accessed address ranges, while
“colder” regions of the address space can be mapped onto large banks.
Clearly, such a non-uniform memory partitioning strategy can out perform
equi-partition when access profiles are highly non-uniform and are known at
design time [11].
Logic partitioning was proposed by Gonzalez et al. [25], where the on-
chip cache is split into a spatial and into a temporal cache to store data with
Design of Memory Systems 489

high spatial and temporal correlation, respectively. This approach relies on a


dynamic prediction mechanism that can be realized without modification to
the application code by means of a prediction buffer.
A similar idea is proposed by Milutinovic et al. [26], where a split
spatial/temporal cache with different line sizes is used. Grun at al. [27]
exploit this idea in the context of embedded systems for energy optimization.
Data are statically mapped to the either cache, using the high predictability
of the access profiles for embedded applications, and thus avoiding the
hardware overhead of the buffer. Depending on the application, data might
be duplicated and thus be mapped to both caches. Another class of logic
partitioning techniques falls within the generic scheme of Figure 16.5.
Buffers are put along the I-cache and/or the D-cache, to realize some form of
cache parallelization. Such schemes can be regarded as a partitioning
solution because the buffers and the caches are actually part of the same
level of hierarchy.

16.4.4 Extending the Memory Hierarchy

Memory partitioning extends the “width” of the memory hierarchy by


splitting, with or without replication, a given hierarchy level. An alternative
possibility is offered by modifying its “depth,” i.e., the number of hierarchy
490 Energy-Efficient System-Level Design

levels. This option does not just imply the straightforward addition of extra
levels of caching.
A first class of techniques is based on the insertion of “ad-hoc” memories
between existing hierarchy levels. This approach is particularly useful for
instruction memory, where access locality is very high. Pre-decoded
instruction buffers [28] store instructions in critical loops in a pre-decoded
fashion, thereby decreasing both fetch and decode energy. Loop caches [29]
store the most frequently executed instructions (typically contained in small
loops) and can bypass even the first-level cache. Notice that these additional
memories would not be useful for performance if the first-level cache can be
accessed in a single cycle. On the contrary, performance can be slightly
worsened because the access time for the loop cache is on the critical path of
the memory system.
Another approach is based on the replacement of one or more levels of
caches with more energy-efficient memory structures. Such structures are
usually called scratch-pad buffers and are used to store a portion of the off-
chip memory, in an explicit fashion. In contrast with caches, reads and writes
to the scratch-pad memory are controlled explicitly by the programmer.
Clearly, allocation of data to the scratch pad should be driven by profiling
and statistics collection. These techniques are particularly effective in
application-specific systems, which run an application mix whose memory
profiles can be studied a priori, thus providing intuitive candidates for the
addresses to be put into the buffer. The work by Panda et al. [30][31] is
probably the most comprehensive effort in this area [31].

16.4.5 Bandwidth Optimization

When the memory architecture is hierarchical, memory transfers become a


critical facet of memory optimization. From a performance viewpoint, both
memory latency and bandwidth are critical design metrics [32]. From an
energy viewpoint, memory bandwidth is much more critical than latency.
Optimizing memory bandwidth implies reducing the average number of bits
that are transferred across the boundary between two hierarchy levels in a
time unit. It has been pointed out [33] that memory bandwidth is becoming
more and more important as a metric for modern systems, because of the
increased instruction-level parallelism generated by superscalar or VLIW
processors and because of the density of integration that allows shorter
latencies. Unlike latency, bandwidth is an average-case quantity. Well-
known latency-reduction techniques, such as prefetching, are inefficient in
terms of bandwidth (and energy).
As an example of bandwidth optimization, the work by Burger et al.
[34][33] introduces several variants of traffic-efficient caches that reduce
Design of Interconnect Networks 491

unnecessary memory traffic by the clever choice of associativity, block size,


or replacement policy, as well as clever fetch strategies fetches. These
solutions do not necessarily improve worst-case latency but result in reduced
read and writes across different memory hierarchy levels, thus reducing
energy as well.
Another important class of bandwidth optimization techniques is based
on the compression of the information passed between hierarchy levels.
These techniques aim at reducing the large amount of redundancy in
instruction streams by storing compressed instructions in the main memory
and decompressing them on the fly before execution. Compression finds
widespread application in wireless networking, where channel bandwidth is
severely limited. In memory compression, the constraints on the speed and
hardware complexity of the compressor and decompressor are much tighter
than in macroscopic networks. Furthermore, memory transfers usually have
very fine granularity (they rarely exceed a few tens of bytes). Therefore, the
achieved compression ratios are usually quite low, but compression speed is
very high. Hardware-assisted compression has been applied mainly to
instruction memory, [38][37][36][35] and, more recently, to data memory
[39]. A comprehensive survey of memory compression techniques can be
found in [40].

16.5 DESIGN OF INTERCONNECT NETWORKS

As technology improves and device sizes scale down, the energy spent on
processing and storage components decreases. On the other hand, the energy
for global communication does not scale down. On the contrary, projections
based on current delay optimization techniques for global wires [41] show
that global communication on chip will require increasingly higher energy
consumption.
The chip interconnect has to be considered and designed as an on-chip
network, called a micro-network [42]. As for general network design, a
layered abstraction of the micro-network (shown in Figure 16.6) can help us
analyze the design problems and find energy-efficient communication
solutions. Next, micro-network layers are considered in a bottom-up fashion.
First, the problems due to the physical propagation of signals on chip are
analyzed. Then general issues related to network architectures and control
protocols are considered. Protocols are considered independently from their
implementation, from the physical to the transport layers. The discussion of
higher-level layers is postponed until Section 5. Last, we close this section
by considering techniques for energy-efficient communication on micro-
networks.
492 Energy-Efficient System-Level Design

16.5.1 Signal Transmission on Chip

Global wires are the physical implementation of on-chip communication


channels. Physical-layer signaling techniques for lossy transmission lines
have been studied for a long time by high-speed board designers and
microwave engineers [43][44].
Traditional rail-to-rail voltage signaling with capacitive termination, as
used today for on-chip communication, is definitely not well-suited for high-
speed, low-energy communication on future global interconnects [44].
Reduced swing, current-mode transmission, as used in some processor-
memory systems, can significantly reduce communication power dissipation
while preserving speed of data communication.
Nevertheless, as technology trends lead us to use smaller voltage swings
and capacitances, error probabilities will rise. Thus the trend toward faster
and lower-power communication may decrease reliability as an unfortunate
side effect. Reliability bounds can be derived from theoretical (entropic)
considerations [45] and measured by experiments on real circuits as voltages
scale.
A paradigm shift is needed to address the aforementioned challenges.
Current design styles consider wiring-related effects as undesirable parasitics
and try to reduce or cancel them by specific and detailed physical design
techniques. It is important to realize that a well-balanced approach should
not over-design wires so that their behavior approaches an ideal one because
Design of Interconnect Networks 493

the corresponding cost in performance, energy-efficiency and modularity


may be too high. Physical-layer design should find a compromise between
competing quality metrics and provide a clean and complete abstraction of
channel characteristics to micro-network layers above.

16.5.2 Network Architectures and Control Protocols

Due to the limitations at the physical level and to the high bandwidth
requirement, it is likely that SoC design will use network architectures
similar to those used for multi-processors. Whereas shared medium (e.g.,
bus-based) communication dominates today's chip designs, scalability
reasons make it reasonable to believe that more general network topologies
will be used in the future. In this perspective, micro-network design entails
the specification of network architectures and control protocols [46]. The
architecture specifies the topology and physical organization of the
interconnection network, while the protocols specify how to use network
resources during system operation.
The data-link layer abstracts the physical layer as an unreliable digital
link, where the probability of bit errors is non null (and increasing as
technology scales down). Furthermore, reliability can be traded for energy
[45][47]. The main purpose of data-link protocols is to increase the
reliability of the link up to a minimum required level, under the assumption
that the physical layer by itself is not sufficiently reliable.
An additional source of errors is contention in shared-medium networks.
Contention resolution is fundamentally a non-deterministic process because
it requires synchronization of a distributed system, and for this reason it can
be seen as an additional noise source. In general, non-determinism can be
virtually eliminated at the price of some performance penalty. For instance,
centralized bus arbitration in a synchronous bus eliminates contention-
induced errors, at the price of a substantial performance penalty caused by
the slow bus clock and by bus request/release cycles.
Future high-performance shared-medium on-chip micro-networks may
evolve in the same direction as high-speed local area networks, where
contention for a shared communication channel can cause errors, because
two or more transmitters are allowed to send data on a shared medium
concurrently. In this case, provisions must be made for dealing with
contention-induced errors.
An effective way to deal with errors in communication is to packetize
data. If data is sent on an unreliable channel in packets, error containment
and recovery is easier because the effect of the errors is contained by packet
boundaries, and error recovery can be carried out on a packet-by-packet
basis. At the data-link layer, error correction can be achieved by using
494 Energy-Efficient System-Level Design

Standard error-correcting codes (ECC) that add redundancy to the


transferred information. Error correction can be complemented by several
packet-based error detection and recovery protocols. Several parameters in
these protocols (e.g., packet size, number of outstanding packets, etc.) can be
adjusted depending on the goal to achieve maximum performance at a
specified residual error probability and/or within given energy consumption
bounds. At the relatively low noise levels typical of on-chip communication,
recent research results [47] indicate that error recovery is more energy-
efficient than forward error correction, but it increases the variance in
communication latency.
At the network layer, packetized data transmission can be customized by
choosing switching or routing algorithms. The former, (e.g., circuit, packet,
and cut-through switching), establishes the type of connection while the
latter determines the path followed by a message through the network to its
final destination. Switching and routing for on-chip micro-networks affect
the performance and energy consumption heavily. Future approaches will
most likely emphasize speed and the decentralization of routing decisions
[48]. Robustness and fault tolerance will also be highly desirable.
At the transport layer, algorithms deal with the decomposition of
messages into packets at the source and their assembly at the destination.
Packetization granularity is a critical design decision, because the behavior
of most network control algorithms is very sensitive to packet size. Packet
size can be application-specific in SoCs, as opposed to general networks. In
general, flow control and negotiation can be based on either deterministic or
statistical procedures. Deterministic approaches ensure that traffic meets
specifications and provide hard bounds on delays or message losses. The
main disadvantage of deterministic techniques is that they are based on worst
cases, and they generally lead to significant under-utilization of network
resources. Statistical techniques are more efficient in terms of utilization, but
they cannot provide worst-case guarantees. Similarly, from an energy
viewpoint, deterministic schemes are expected to be more inefficient than
statistical schemes because of their implicit worst-case assumptions.

16.5.3 Energy-efficient Design: Techniques and Examples

This section delves into a few specific instances of energy-efficient micro-


network design problems. In most cases, specific solutions that have been
proposed in the literature are also outlined, although it should be clear that
many design issues are open and significant progress in this area is expected
in the near future.
Design of Interconnect Networks 495

16.5.3.1 Physical Layer

At the physical layer, low-swing signaling is actively investigated to reduce


communication energy on global interconnects [49]. In the case of a simple
CMOS driver, low-swing signaling is achieved by lowering the driver's
supply voltage This implies a quadratic dynamic-power reduction
(because Unfortunately, swing reduction at the transmitter
complicates the receiver's design. Increased sensitivity and noise immunity
are required to guarantee reliable data reception. Differential receivers have
superior sensitivity and robustness, but they require doubling the bus width.
To reduce the overhead, pseudo-differential schemes have been proposed,
where a reference signal is shared among several bus lines and receivers, and
incoming data is compared against the reference in each receiver. Pseudo-
differential signaling reduces the number of signal transitions, but it has
reduced noise margins with respect to fully-differential signaling. Thus,
reduced switching activity is counterbalanced by higher swings, and
determining the minimum-energy solution requires careful circuit-level
analysis.
Another key physical-layer issue is synchronization. Traditional on-chip
communication has been based on the synchronous assumption, which
implies the presence of global synchronization signals (i.e., clocks) that
define data sampling instants throughout the chip. Unfortunately, clocks are
extremely energy-inefficient, and it is a well-known fact that they are
responsible for a significant fraction of the power budget in digital integrated
systems. Thus, postulating global synchronization when designing on-chip
micro-networks is not an optimal choice from the energy viewpoint.
Alternative on-chip synchronization protocols that do not require the
presence of a global clock have been proposed in the past [50][51] but their
effectiveness has not been studied in detail from the energy viewpoint.

16.5.3.2 Data-link Layer

At the data-link layer, a key challenge is to achieve the specified


communication reliability level with minimum energy expense. Several error
recovery mechanisms developed for macroscopic networks can be deployed
in on-chip micro-networks, but their energy efficiency should be carefully
assessed in this context. As a practical example, consider two alternative
reliability-enhancement techniques: error-correcting codes and error-
detecting codes with retransmission. Both approaches are based on
transmitting redundant information over the data link, but error-correction is
generally more demanding than error detection in terms of redundancy and
decoding complexity. Hence, we can expect error-correcting transmission to
496 Energy-Efficient System-Level Design

be more power-hungry in the error-free case. However, when an error arises,


error-detecting schemes require retransmission of the corrupted data.
Depending on the network architecture, retransmission can be very costly in
terms of energy (and performance).
Clearly, the trade-off between the increased cost of error correction and
the energy penalty of retransmission should be carefully explored when
designing energy-efficient micro-networks [45]. Either scheme may be
optimal, depending on system constraints and on physical channel
characteristics. Automatic design space exploration could be very beneficial
in this area.
Bertozzi et al. [47] considered error-resilient codes for 32-bit buses.
Namely, they consider Hamming encoding/decoding schemes that support
single-error correction, double-error detection, and (non-exhaustive) multi-
error detection. The physical overhead of these schemes is 6 or 7 additional
bus lines plus the encoders and decoders. When error is detected and not
corrected, data retransmission occurs. When error is not detected, the system
has a catastrophic failure. For a given reliability specification of mean time
to failure (MTTF) - ranging from 10 years to a few milliseconds - it is
possible to determine the average energy per useful bit that is transmitted
under various hypotheses. Such hypotheses include wiring length, and thus
the ratio of energy spent on wires over the energy spent in coding, and
voltage swings. In particular, for long MTTF sec) and wires (5 pF),
error detection with retransmission is more energy-efficient than forward
error correction, mainly for two reasons. First, for the same level of
redundancy, error detection is more robust than error correction; hence, the
signal-to-noise ratio can be lowered more aggressively. Second, the error-
detecting decoder is simpler and consumes less power than the error-
correcting decoder. These two advantages overcome retransmission costs,
which are sizable, but they are incurred under the relatively rare occurrence
of transmission errors.
In case of shared-medium network links (such as busses), the media-
access-control function of the data link layer is also critical for energy
efficiency. Currently, centralized time-division multiplexing schemes (also
called centralized arbitration) are widely adopted [52][53][54], In these
schemes, a single arbiter circuit decides which transmitter accesses the bus
for every time slot. Unfortunately, the poor scalability of centralized
arbitration indicates that this approach is likely to be energy-inefficient as
micro-network complexity scales up. In fact, the energy cost of
communicating with the arbiter and the hardware complexity of the arbiter
itself scale up more than linearly with the number of bus masters.
Distributed arbitration schemes as well as alternative multiplexing
approaches, such as code division multiplexing, have been extensively
Design of Interconnect Networks 497

adopted in shared-medium macroscopic networks and are actively being


investigated for on-chip communication [55]. However, research in this area
is just burgeoning, and significant work is needed to develop energy-aware
media-access-control for future micro-networks.

16.5.3.3 Network Layer

Network architecture heavily influences communication energy. As hinted in


the previous section, shared-medium networks (busses) are currently the
most common choice, but it is intuitively clear that busses are not energy-
efficient as network size scales up [56]. In bus-based communication, data is
always broadcasted from one transmitter to all possible receivers, while in
most cases messages are destined to only one receiver or a small group of
receivers. Bus contention, with the related arbitration overhead, further
contributes to the energy overhead.
Preliminary studies on energy-efficient on-chip communication indicate
that hierarchical and heterogeneous architectures are much more energy-
efficient than busses [57][51]. In their work, Zhang et al. [51] develop a
hierarchical generalized mesh where network nodes with a high
communication bandwidth requirement are clustered and connected through
a programmable generalized mesh consisting of several short
communication channels joined by programmable switches. Clusters are
then connected through a generalized mesh of global long communication
channels. Clearly such architecture is heterogeneous because the energy cost
of intra-cluster communication is much smaller than that of inter-cluster
communication. While the work of Zhang et al. demonstrates that power can
be saved by optimizing network architecture, many network design issues
are still open, and tools and algorithms are needed to explore the design
space and to tailor network architecture to specific applications or classes of
applications.
Network architecture is only one facet of network layer design, the other
major facet being network control. A critical issue in this area is the choice
of a switching scheme for indirect network architectures. From the energy
viewpoint, the tradeoff is between the cost of setting up a circuit-switched
connection once for all and the overhead for switching packets throughout
the entire communication time on a packet-based connection. In the former
case the network control overhead is “lumped” and incurred once, while in
the latter case, it is distributed over many small contributions, one for each
packet. When communication flow between network nodes is extremely
persistent and stationary, circuit-switched schemes are likely to be
preferable, while packet-switched schemes should be more energy-efficient
for irregular and non-stationary communication patterns. Needless to say,
498 Energy-Efficient System-Level Design

circuit switching and packet switching are just two extremes of a spectrum,
with many hybrid solutions in between [58].

16.5.3.4 Transport Layer

Above the network layer, the communication abstraction is an end-to-end


connection. The transport layer is concerned with optimizing the use of
network resources and providing a requested quality of service. Clearly,
energy can be seen as a network resource or a component in a quality of
service metric. An example of a transport-layer design issue is the choice
between connection-oriented and connectionless protocols. Energy
efficiency can be heavily impacted by this decision. In fact, connection-
oriented protocols can be energy inefficient under heavy traffic conditions
because they tend to increase the number of re-transmissions. On the other
hand, out-of-order data delivery may imply additional work at the receiver,
which causes additional energy consumption. Thus, communication energy
should be balanced against computation energy at destination nodes.
Another transport-layer task with far-reaching implications on energy is
flow control. When many transmitters compete for limited communication
resources, the network becomes congested, and the cost per transmitted bit
increases because of increased contention and contention resolution
overhead. Flow control can mitigate the effect of congestion by regulating
the amount of data that enters the network at the price of some throughput
penalty. Energy reduction by flow control has been extensively studied for
wireless networks [58][59], but it is an unexplored research area for on-chip
micro-networks.

16.6 SOFTWARE

Systems have several software layers running on top of the hardware. Both
system and application software programs are considered here.
Software does not consume energy per se, but it is the execution and
storage of software that requires energy consumption by the underlying
hardware. Software execution corresponds to performing operations on
hardware, as well as storing and transferring data. Thus software execution
involves power dissipation for computation, storage, and communication.
Moreover, storage of computer programs in semiconductor memories
requires energy (e.g., refresh of DRAMs, static power for SRAMs).

The energy budget for storing programs is typically small (with the
choice of appropriate components) and predictable at design time.
Software 499

Nevertheless, reducing the size of the stored programs is beneficial. This can
be achieved by compilation (see Section 6.2.2) and code compression. In the
latter case, the compiled instruction stream is compressed before storage. At
run time, the instruction stream is decompressed on the fly. Besides reducing
the storage requirements, instruction compression reduces the data traffic
between memory and processor and the corresponding energy cost. (See also
Section 4.5.) Several approaches have been devised to reduce instruction
fetch-and-store overhead, as surveyed in [11], The following subsections
focus mainly on system-level design techniques to reduce the power
consumption associated with the execution of software.

16.6.1 System Software

The notion of operating system (OS) is generalized to capture the system


programs that provide support for the operation of SoCs. Note that the
system support software in current SoCs usually consists of ad hoc routines,
designed for a specific integrated core processor, under the assumption that a
processor provides global, centralized control for the system. In future SoCs,
the prevailing paradigm will be peer-to-peer interaction among several,
possibly heterogeneous, processing elements. Thus, system software will be
designed as a modular distributed system. Each programmable component
will be provided with system software to support its own operation, to
manage its communication with the communication infrastructure, and to
interact effectively with the system software of the neighboring components.
Seamless composition of components around the micro-network will
require the system software to be configurable according to the requirements
of the network. Configuration of system software may be achieved in
various ways, ranging from manual adaptation to automatic configuration.
At one end of the spectrum, software optimization and compactness are
privileged; at the other end, design ease and time are favored. With this
vision, on-chip communication protocols should be programmable at the
system software level, to adapt the underlying layers (e.g., transport) to the
characteristics of the components.
Let us now consider the broad objectives of system software. For most
SoCs, which are dedicated to some specific application, the goal of system
software is to provide the required quality of service within the physical
constraints. Consider, for example, an SoC for a wireless mobile video
terminal. Quality of service relates to the video quality, which implies
specific performance levels of the computation and storage elements as well
as of the micro-network. Constraints relate to the strength and S/N ratio of
the radio-frequency signal and to the energy available in the battery. Thus,
500 Energy-Efficient System-Level Design

the major task of system software is to provide high performance by


orchestrating the information processing within the service stations and
providing the “best” information flow. Moreover, this task should be
achieved while keeping energy consumption to a minimum.
The system software provides us with an abstraction of the underlying
hardware platform. In a nutshell, one can view the system as a queuing
network of service stations. Each service station models a computational or
storage unit, while the queuing network abstracts the micro-network.
Moreover, one can assume that:

Each service station can operate at various service levels, providing


corresponding performance and energy consumption levels. This
abstracts the physical implementation of components with adjustable
voltage and/or frequency levels, as well as with the ability to disable
their functions in full or in part.
The information flow between the various units can be controlled by
the system software to provide the appropriate quality of service. This
entails controlling the routing of the information, the local buffering
into storage arrays, and the rate of the information flow.

In other words, the system software must support the dynamic power
management (DPM) of its components as well as dynamic information-flow
management.

16.6.1.1 Dynamic Power Management

Dynamic power management (DPM) is a feature of the run-time


environment of an electronic system that dynamically reconfigures it to
provide the requested services and performance levels with a minimum
number of active components or a minimum activity level on such
components [35]. DPM encompasses a set of techniques that achieve energy-
efficient computation by selectively turning off (or reducing the performance
of) system components when they are idle (or partially unexploited). DPM is
often realized by throttling the frequency of processor operation (and
possibly stopping the clock) and/or reducing the power supply voltage.
Dynamic frequency scaling (DFS) and dynamic voltage scaling (DVS) are
the terms commonly used to denote power management over a range of
values. Typically, DVS is used in conjunction with DFS since reduced
voltage operation requires lower operating frequencies, while the converse is
not true.
The fundamental premise for the applicability of DPM is that systems
(and their components) experience non-uniform workloads during operation
Software 501

time. Such an assumption is valid for most systems, both when considered in
isolation and when inter-networked. A second assumption of DPM is that it
is possible to predict, with a certain degree of confidence, the fluctuations of
workload. Workload observation and prediction should not consume
significant energy.
Designing power-managed systems encompasses several tasks, including
the selection of power-manageable components with appropriate
characteristics, determining the power management policy [35], and
implementing the policy at an appropriate level of system software. DPM
was described in a previous Chapter. This chapter considers only the
relations between DPM policy implementation and system software.
A power management policy is an algorithm that observes requests and
states of one or more components and issues commands related to frequency
and voltage settings. In particular, the power manger can turnon/off the clock
and/or the power supply to a component. Whereas policies can be
implemented in hardware (as a part of the control-unit of a component),
software implementations achieve much greater flexibility and ease of
integration. Thus a policy can be seen as a program that is executed at run-
time by the system software.
The simplest implementation of a policy is by a filter driver, i.e., by a
program attached to the software driver of a specific component. The driver
monitors the traffic to/from the component and has access to the component
state. Nevertheless, the driver has a limited view of other components. Thus
such an implementation of power management may suffer from excessive
locality.
Power management policies can be implemented in system kernels and
be tightly coupled to process management. Indeed, process management has
knowledge of currently-executing tasks and tasks coming up for execution.
Process managers also know which components (devices) are needed by
each task. Thus, policy implementation at this level of system software
enjoys both a global view and an outlook of the system operation in the near
future. Predictive component wake-up is possible with the knowledge of
upcoming tasks and required components.
The system software can be designed to improve the effectiveness of
power management. Power management exploits idle times of components.
The system software scheduler can sequence tasks for execution with the
additional goal of clustering component operation, thus achieving fewer but
longer idle periods. Experiments with implementing DPM policies at
different levels of system software [60] have shown increasing energy
savings as the policies have deeper interaction with the system software
functions.
502 Energy-Efficient System-Level Design

16.6.1.2 Information-flow Management

Dynamic information-flow management relates to configuring the micro-


network and its bandwidth to satisfy the information flow requirements. This
problem is tightly related to DPM and can be seen as an application of DPM
to the micro-network instead of to components. Again, policies implemented
at the system software layer request either specific protocols or parameters at
the lower layers to achieve the appropriate information flow, using the least
amount of resources and energy.
An example of information-flow management is provided by the Maia
processor [61], which combines an ARM8 processor core with 21 satellite
units, including processing and storage units. The ARM8 processor
configures the memory-mapped satellites using a 32bit configuration bus,
and communicates data with satellites using two pairs of I/O interface ports
and direct memory read/writes. Connections between satellites are through a
2-level hierarchical mesh-structured reconfigurable network. Dynamic
voltage scaling is applied to the ARM8 core to increase energy efficiency.
With this approach, the micro-network can be configured before running
specific applications and tailored to these applications. Thus, application
programs can be spatially distributed and achieve an energy savings of one
order of magnitude as compared to a DSP processor with the same
performance level. Such savings are due to the ability of Maia to reconfigure
itself to best match the applications, to activate satellites only when data is
present, and to operate at dynamically varying rates.

16.6.2 Application Software

The energy cost of executing a program depends on its machine code and on
the corresponding micro-architecture, if one excludes the intervention of the
operating system in the execution (e.g., swapping). Thus, for any given
micro-architecture, the energy cost is tied to the machine code.
There are two important problems of interest: software design and
software compilation. Software design affects energy consumption because
the style of the software source program (for any given function) affects the
energy cost. For example, the probability of swapping depends on
appropriate array dimensioning while considering the hardware storage
resources. As a second example, the use of specific constructs, such as
guarded instructions instead of branching constructs for the ARM
architecture [6], may significantly reduce the energy cost. Several efforts
have addressed the problem of automatically re-writing software programs to
increase their efficiency. Other efforts have addressed the generation of
Software 503

energy-efficient software from high-level specification. We call these


techniques software synthesis.
Eventually, since the machine code is derived from the source code from
compilation, it is the compilation process itself that affects the energy
consumption. It is important to note that most compilers were written for
achieving high-performing code with short compilation time. The design of
embedded systems running dedicated software has brought a renewed
interest in compilation, especially because of the desire of achieving high-
quality code (i.e., fast, energy efficient) possibly at the expense of longer
compilation time (which is tolerable for embedded systems running code
compiled by the manufacturer).
For both software synthesis and compilation it is important to define the
metrics of interest well. Typically, the performance (e.g., latency) and
energy of a given program can be evaluated in the worst or average case.
Worst-case latency analysis is relevant to real-time software design when
hard timing constraints are specified. In general, average latency and average
energy consumption are of interest. Average measures require the
knowledge of the environment, i.e., the distribution of program inputs, which
eventually affect the branches taken and the number of iterations. When such
information is unavailable, meaningful average measures are impossible to
achieve.
To avoid this problem, some authors have measured the performance and
energy on the basic blocks, thus avoiding the effects of branching and
iteration. It is often the case that instructions can be grouped into two
classes. Instructions with no memory access tend to have similar energy cost
and execute in a single cycle. Instructions with memory access have higher
latency and energy cost. With these assumptions, reducing code size and
reducing memory accesses (e.g., spills) achieves the fastest and most energy-
efficient code. Nevertheless this argument breaks down when instructions
(with no memory access) have non-uniform energy cost even though
experimental results do not show significant variation between compilation
for low latency and for low energy.
It is very important to stress that system design requires the coordination
of various hardware and software components. Thus, evaluation of software
programs cannot be done in isolation. Profiling techniques can and must be
used to determine the frequency distribution of the values of the input to
software programs and subprograms. Such information is of paramount
importance for achieving application software that is energy efficient in the
specific environment where it will be executed. It is also interesting to note
that, given a specific environment profile, the software can be restructured so
that lower energy consumption can be achieved at the price of slightly higher
latency. In general, the quest for maximum performance pushes toward the
504 Energy-Efficient System-Level Design

speculative execution and aggressive exploitation of all hardware resources


available in the system. In contrast, energy efficiency requires a more
conservative approach, which limits speculation and reduces the amount of
redundant work that can be tolerated for a marginal performance increase
[62].

16.6.2.1 Software Synthesis

Software synthesis is a term used with different connotations. In the present


context, software synthesis is an automated procedure that generates source
code that can be compiled. Whereas source code programs can be
synthesized from different starting points, source code synthesis from
programs written in the same programming language are considered here.
Software synthesis is often needed because the energy consumption of
executing a program depends on the style and constructs used. Optimizing
compilers are biased by the starting source code to be compiled. Recall that
programs are often written with only functionality and/or performance in
mind, and rarely with concerns for energy consumption. Moreover, it is
common practice to use legacy code for embedded applications, sometimes
with high-energy penalties. Nevertheless, it is conceivable to view this type
of software synthesis as pre-processing for compilation with specific goals.

Source-level transformations
Recently several researchers have proposed source-to-source transformations
to improve software code quality, and in particular energy consumption.
Some transformations are directed toward using storage arrays more
efficiently [13][63]. Others exploit the notion of value locality. Value
locality is defined as the likelihood of a previously-seen value recurring
repeatedly within a physical or logical storage location [64]. With value
locality information, reusing previous computations can reduce the
computational cost of a program.
Researchers have shown that value locality can be exploited in various
ways depending on the target system architecture. In [65], common-case
specialization was proposed for hardware synthesis using loop unrolling and
algebraic reduction techniques. In [66][64], value prediction was proposed to
reduce the load/store operations with the modification of a general-purpose
microprocessor. Some authors [67] considered redundant computation, i.e.,
performing the same computation for the same operand value. Redundant
computation can be avoided by reusing results from a result cache.
Unfortunately, some of these techniques are architecture dependent, and thus
cannot be used within a general-purpose software synthesis utility.
Software 505

Next a family of techniques for source code optimization, based on


specialization of programs and data, is considered. Program specialization
encodes the results of previous computations in a residual program, while
data specialization encodes these results in the data structures like caches
[84]. Program specialization is more aggressive in the sense that it optimizes
even the control flow, but it can lead to a code explosion problem due to
over-specialization. For example, code explosion can occur when a loop is
unrolled and the number of iterations is large. Furthermore, code explosion
can degrade the performance of the specialized program due to increased
instruction cache misses.
On the other hand, data specialization is much less sensitive to code
explosion because the previous computation results are stored in a data
structure that requires less memory than the textual representation of
program specialization. However, this technique should be carefully applied
such that the cached previous computations are expensive enough to
amortize the cache access overhead. The cache can also be implemented in
hardware to amortize the cache access overhead [67].
A specific instance of program specialization was proposed by Chung et
al. [68]. In this approach, the computational effort of a source code program
is estimated with both value and execution-frequency profiling. The most
effective specializations are automatically searched and identified, and the
code is transformed through partial evaluation. Experimental results show
that this technique improves both energy consumption and performance of
the source code up to more than a factor of two and in average about 35%
over the original program.

Example 2 Consider the source code in Figure 16.7 (a), and the first call of
procedure foo in procedure main. If the first parameter a were 0 for all
cases, this procedure could be reduced to procedure sp_foo by partial
evaluation, as shown in Figure 16.7 (b).
In reality, the value of parameter a is not always 0, and the call to
procedure foo cannot be substituted by procedure sp_foo. Instead, it can be
replaced by a branching statement that selects an appropriate procedure
call, depending on the result of the common value detection (CVD). The
CVD procedure is named cvd_foo in Figure 16.7 (b). This is called
transformation step source code alternation. Its effectiveness depends on the
frequency with which a takes the common value 0.
Software libraries
Software engineers working on embedded systems use often software
libraries, like those developed by standards groups (e.g, MPEG) or by
system companies (e.g., Intel's multimedia library for the SA-1110 and TI's
library for the TI'54x DSP.) Embedded operating systems typically provide a
506 Energy-Efficient System-Level Design

choice from a number of math and other libraries [69]. When a set of pre-
optimized libraries is available, the designer has to choose the elements that
perform best for a given section of the code. Such a manual optimization is
error-prone and should be replaced by automated library insertion techniques
that can be seen as part of software synthesis.
For example, consider a section of code that calls the log function. The
library may contain four different software implementations: double, float,
fixed point using simple bit manipulation algorithm [93][89], and fixed point
using polynomial expansion. Each implementation has a different accuracy,
performance, and energy trade-off.
Thus, the automation of the use of software libraries entails two major
tasks. First, characterize the library element implementations in terms of the
criteria of interest. This can be achieved by analyzing the corresponding
instruction flow for a given architecture. Second, recognize the sections of
code that can be replaced effectively by library elements.
In the case of computation-intensive basic blocks of data-flows, code
manipulation techniques based on symbolic algebra have shown to be
effective in both optimizing the computation by reshaping the data flow and
in performing the automatic mapping to library elements. Moreover, these
tasks can be fully automated. These methods are based on the premise that in
several application domains (e.g., multimedia) computation can be reduced
to the evaluation of polynomials with fixed-point precision. The loss in
accuracy is usually compensated by faster evaluation and lower energy
consumption. Next, polynomials can be algebraically manipulated using
symbolic techniques, similar to those used by tools such as Maple.
Polynomial representations of computation can be also decomposed into
sequences of operations to be performed by software library elements or
elementary instructions. Such a decomposition can be driven by energy
and/or performance minimization goals. Recent experiments have shown
large energy gains on applications such as MP3 decoding [70].
Software 507
508 Energy-Efficient System-Level Design

16.6.2.2 Software Compilation

Most software compilers consist of three layers: the front-end, the machine-
independent optimization, and the back-end. The front-end is responsible for
parsing and performing syntax and semantic analysis, as well as for
generating an intermediate form, which is the object of many machine-
independent optimizations [71]. The back-end is specific to the hardware
architecture, and it is often called code generator or codegen. Typically,
energy-efficient compilation is performed by introducing specific
transformations in the back-end, because they are directly related to the
underlying architecture. Nevertheless, some machine-independent
optimizations can be useful in general to reduce energy consumption [72].
An example is selective loop unrolling, which reduces the loop overhead but
is effective if the loop is short enough. Another example is software
pipelining, which decreases the number of stalls by fetching instructions
from different iterations. A third example is removing tail recursion, which
eliminates the stack overhead.
The main tasks of a code generator are instruction selection, register
allocation, and scheduling. Instruction selection is the task of choosing
instructions, each performing a fragment of the computation. Register
allocation is the task of allocating data to registers; when all registers are in
use, data is spilled to the main memory. Spills are usually undesirable
because of the performance and energy overhead of saving temporary
information in the main memory. Instruction scheduling is ordering
instructions in a linear sequence. When considering compilation for general-
purpose microprocessors, instruction selection and register allocation are
often achieved by dynamic programming algorithms [71], which also
generate the order of the instructions. When considering compilers for
application-specific architectures (e.g., DSPs), the compiler back-end is
often more complex, because of irregular structures such as inhomogeneous
register sets and connections. As a result, instruction selection, register
allocation, and scheduling are intertwined problems that are much harder to
solve [73].
Energy-efficient compilation-exploiting instruction selection was
proposed by Tiwari et al. [74] and tied to software analysis and
determination of base costs for operations. Tiwari proposed an instruction
selection algorithm based on the classical dynamic programming tree cover
[71] where instruction weights are the energy costs. Experimental results
showed that this algorithm yields results similar to the traditional algorithm
because energy weights do not differ much in practice.
Instruction scheduling is an enumeration of the instructions consistent
with the partial order induced by data and control flow dependencies.
Software 509

Instruction re-ordering for low-energy can be done by exploiting the degrees


of freedom allowed by the partial order. Instruction re-ordering may have
several beneficial effects, including reduction of inter-instruction effects
[75] [76] as well as switching on the instruction bus [77] and/or in some
hardware circuits, such as the instruction decoder.
Su et al. [77] proposed a technique called cold scheduling, which aims at
ordering the instructions to reduce the inter-instruction effects. In their
model, the inter-instruction effects were dominated by the switching on the
internal instruction bus of a processor and by the corresponding power
dissipation in the processor's control circuit. Given op-codes for the
instructions, each pair of consecutive instructions requires as many bit lines
to switch as the Hamming distance between the respective op-codes. The
cold scheduling algorithm belongs to the family of list schedulers [78]. At
each step of the algorithm, all instructions that can be scheduled next are
placed on a ready list. The priority for scheduling an instruction is inversely
proportional to the Hamming distance from the currently scheduled
instruction, thus minimizing locally the inter-instruction energy consumption
on the instruction bus. Su [77] reported a reduction in overall bit switching
in the range of 20 to 30%.
Register assignment aims at utilizing the available registers most
effectively by reducing spills to main memory. Moreover, a register can be
labeled during the compilation phase, and register assignment can be
performed with the objective of reducing the switching in the instruction
register as well as in the register decoders [72]. Again, the idea is to reduce
the Hamming distance between pairs of consecutive register accesses. When
comparing this approach to cold scheduling, note that now the instruction
order is fixed, but the register labels can be changed. Metha et al. [72]
proposed an algorithm that improves upon an initial register labeling by
greedily swapping labels until no further switching reduction is allowed.
Experimental results showed an improvement ranging from 4.2% to 9.8%.
Registers are only the last level of a memory hierarchy, which usually
contains caches, buffers, multi-banked memories, etc. Compilers can have a
large impact on energy consumption by optimizing not only register accesses
but all kinds of memory traffic patterns as well. Many compiler
transformations have limited scope, and they are not very effective in
reducing memory power outside the register file. However, some restricted
classes of programming constructs (namely, loop nets with data-independent
iterations) can be transformed and optimized by the compiler in a very
aggressive fashion. The theory and practice of loop transformations was
intensely explored by parallelizing and high-performance compilers in the
past [79], and it is being revisited from a memory energy minimization
viewpoint with promising results [80][81][63], These techniques are likely to
510 Energy-Efficient System-Level Design

have greater impact on SoCs because they have very heterogeneous memory
architectures, and they often expose memory transfers to the programmer, as
outlined in the case studies (this is rarely done in general-purpose
processors).

16.6.2.3 Application Software and Power Management

The quest for very low energy software cost leads to the crafting and tuning
of very specific application programs. Thus, a reasonable question is–why
not let the application programs finely control the service levels and energy
cost of the underlying hardware components? There are typically two
objections to such an approach. First, application software should be
independent of the hardware platform for portability reasons. Second, system
software typically supports multiple tasks. When a task controls the
hardware, unfair resource utilization and deadlocks may become serious
problems.
For these reasons, it has been suggested [82] that application programs
contain system calls that request the system software to control a hardware
component, e.g., by turning it on or shutting it down, or by requesting a
specific frequency and/or voltage setting. The request can be accepted or
denied by the operating system, which has access to the task schedule
information and to the operating levels of the components. The advantage of
this approach is that OS-based power management is enhanced by receiving
detailed service request information from applications and thus is in a
position to make better decisions.
Another approach is to let the compiler extract the power management
requests directly from the application programs at compile time. This is
performed by an analysis of the code. Compiler-directed power management
has been investigated for variable-voltage, variable-speed systems. A
compiler can analyze the control-data flow graph of a program to find paths
where execution time is much shorter than the worst-case. It can then insert
voltage downscaling directives at the entry points of such paths, thereby
slowing down the processor (and saving energy) only when there is
sufficient slack [83].

16.7 SUMMARY

This concluding chapter has surveyed some of the challenges in achieving


energy-efficient system-level design, with specific emphasis on SoC
implementation.
Summary 511

Digital systems with very low energy consumption require the use of
components that exploit all features of the underlying technologies (as
described in the previous chapters) and the realization of an effective
interconnection of such components. Network technologies will play a major
role in the design of future SoCs, as the communication among components
will be realized as a network on chip. Micro-network architectural choices
and control protocol design will be key in achieving high performance and
low-energy consumption.
A large, maybe dominant, effort in SoC design is spent in writing
software, because the operation of programmable components can be
tailored to specific needs by means of embedded software. System software
must be designed to orchestrate the concurrent operation of on-chip
components and network. Dynamic power management and information-
flow management are implemented at the system software level, thus adding
to the complexity of its design. Eventually, application software design,
synthesis, and compilation will be crucial tasks in realizing low-energy
implementations.
Because of the key challenges presented in this book, SoC design
technologies will remain a central engineering problem, deserving large
human and financial resources for research and development.

REFERENCES
[1] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, M. Horowitz, “Smart Memories: a
modular reconfigurable architecture,” IEEE International Symposium on Computer
Architecture, pp. 161-171, June 2000.
[2] D. Patterson, et al., “A Case for intelligent RAM,” IEEE Micro, vol. 17, no. 2, pp. 34-44,
March-April 1997.
[3] Shubat, “Moving the market to embedded memory,” IEEE Design & Test of Computers,
vol. 18, no. 3, pp. 16-27, May-June 2001.
[4] M. Suzuoki et al., “A Microprocessor with a 128-bit CPU, Ten Floating-Point MACs,
Four Floating-Point Dividers, and an MPEG-2 Decoder,” IEEE Journal of Solid-State
Circuits, vol. 34, no. 11, pp. 1608--1618, Nov. 1999.
[5] Kunimatsu et al., “Vector Unit Architecture for Emotion Synthesis,” IEEE Micro, vol.
20, no. 2, pp. 40-47, March-April 2000.
[6] L. Benini, G. De Micheli, “System-Level Power Optimization: Techniques and Tools,”
ACM Transactions on Design Automation of Electronic Systems, vol. 5, no. 2, pp. 115-
192, April 2000.
[7] M. Takahashi et al., “A 60-MHz 240-mW MPEG-4 Videophone LSI with 16-Mb
embedded DRAM,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp. 1713-
1721, Nov. 2000.
512 Energy-Efficient System-Level Design

[8] H. V. Tran et al., “A 2.5-V, 256-level nonvolatile analog storage device using EEPROM
technology,” IEEE International Solid-State Circuits Conference, pp. 270-271, Feb.
1996.
[9] G. Jackson et al., “An Analog Record, playback and processing system on a chip for
mobile communications devices,” IEEE Custom Integrated Circuits Conference, pp. 99-
102, San Diego, CA, May 1999.
[10] M. Borgatti et al., ”A 64-Min Single-Chip Voice Recorder/Player Using Embedded 4-
b/cell FLASH Memory,” IEEE Journal of Solid-State Circuits, vol. 36, no. 3, pp. 516-
521, March. 2001.
[11] Macii, L. Benini, M. Poncino, Memory Design Techniques for Low Energy Embedded
Systems, Kluwer, 2002.
[12] Gartner, Inc., Final 2000 Worldwide Semiconductor Market Share, 2000.
[13] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle,
Custom Memory Management Methodology: Exploration of Memory Organization for
Embedded Multimedia System Design, Kluwer, 1998
[14] D. Lidsky, J. Rabaey, “Low-power design of memory intensive functions,” IEEE
Symposium on Low Power Electronics, San Diego, CA, pp. 16-17, September 1994.
[15] P. R. Panda, F. Catthor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni,A.
Vandecappelle, P. G. Kjeldsberg, “Data and memory optimization techniques for
embedded systems,” ACM Transactions on Design Automation of Electronic Systems,
vol. 6, no. 2, pp. 149-206, April 2001.
[16] W Shiue, C. Chakrabarti, “Memory exploration for low power, embedded systems,”
DAC-36: ACM/IEEE Design Automation Conference, pp. 140-145, June 1999.
[17] L. Su, A. Despain, “Cache design trade-offs for power and performance optimization: A
case study,” ACM/IEEE International Symposium on Low Power Design, pp. 63-68,
April 1995.
[18] M. Kamble, K. Ghose, “Analytical energy dissipation models for low-power caches,”
ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 143-
148, August 1997.
[19] U. Ko, P. Balsara, A. Nanda, “Energy optimization of multilevel cache architectures for
RISC and CISC processors,” IEEE Transactions on VLSI Systems, vol. 6, no. 2, pp. 299-
308, June 1998.
[20] R. Bahar, G. Albera, S. Manne, “Power and performance tradeoffs using various caching
strategies,” ACM/IEEE International Symposium on Low Lower Electronics and Design,
pp. 64-69, Aug. 1998.
[21] V. Zyuban, P. Kogge, “The energy complexity of register files,” ACM/IEEE
International Symposium on Low Power Electronics and Design, pp. 305-310, Aug.t
1998.
[22] S. Coumeri, D. Thomas, ”Memory modeling for system synthesis,” ACM/IEEE
International Symposium on Low Power Electronics and Design, pp. 179-184, Aug.
1998.
[23] T. Juan, T. Lang, J. Navarro, “Reducing TLB power requirements,” ACM/IEEE
International Symposium on Low Power Electronics and Design, pp. 196-201, August
1997.
[24] Farrahi, G. Tellez, M. Sarrafzadeh, “Memory segmentation to exploit sleep mode
operation,” ACM/IEEE Design Automation Conference, pp. 36-41, June 1995.
[25] Gonzàlez, C. Aliagas, M. Valero, “A Data-cache with multiple caching strategies tuned
to different types of locality,” ACM International Conference on Supercomputing, pp.
338--347, July 1995.
Summary 513

[26] V. Milutinovic, B. Markovic, M. Tomasevic, M. Tremblay, “A new cache architecture


concept: The Split Temporal/Spatial Cache,” IEEE Mediterranean Electrotechnical
Conference, pp. 1108-1111, March 1996.
[27] P. Grun, N. Dutt, A. Nicolau, “Access pattern based local memory customization for
low-power embedded systems,” Design Automation and Test in Europe, pp. 778--784,
March 2001.
[28] R. Bajwa, M. Hiraki, H. Kojima, D. Gorny, K. Nitta, A. Shridhar, K. Seki, K. Sasaki,
“Instruction buffering to reduce power in processors for signal processing,” IEEE
Transactions on VLSI Systems, vol. 5, no. 4, pp. 417-424, Dec. 1998.
[29] J. Kin, M. Gupta, W. Mangione-Smith, “The filter cache: an energy efficient memory
structure,” IEEE/ACM International Symposium on Microarchitecture, pp. 184-193,
Dec. 1997.
[30] P. Panda, N. Dutt, Memory Issues in Embedded Systems-on-Chip Optimization and
Exploration, Kluwer, 1999.
[31] P. Panda, N. Dutt, A. Nicolau, “On-chip vs. off-chip memory: the data partitioning
problem in embedded processor-based systems,” ACM Transactions on Design
Automation of Electronic Systems, vol. 5, no. 3, pp. 682--704, July 2001.
[32] J. Hennessy, D. Patterson, Computer Architecture - A Quantitative Approach, II Edition,
Morgan Kaufmann Publishers, 1996.
[33] D. C. Burger, Hardware Techniques to Improve the Performance of the
Processor/Memory Interface, Ph.D. Dissertation, University of Wisconsin-Madison,
1998.
[34] D.Burger, J. Goodman, A. Kagle, ”Limited bandwidth to affect processor design,” IEEE
Micro, vol. 17, no. 6, November/December 1997.
[35] L. Benini, A. Bogliolo, G. De Micheli, “A survey of design techniques for system-level
dynamic power management,” IEEE Transactions on Very Large-Scale Integration
Systems, vol. 8, no. 3, pp. 299-316, June 2000.
[36] H. Lekatsas, W. Wolf, “Code compression for low power embedded systems,”
ACM/IEEE Design Automation Conference, pp. 294--299, June 2000.
[37] S. Liao, S. Devadas, K. Keutzer, “Code density optimization for embedded DSP
processors using Data compression techniques,” IEEE Transactions on CAD/ICAS, vol.
17, no. 7, pp. 601--608, July 1998.
[38] Y. Yoshida, B. Song, H. Okuhata, T. Onoye, I. Shirakawa, “An object code compression
approach to embedded processors,” ACM/IEEE International Symposium on Low Power
Electronics and Design, pp. 265-268, August 1997.
[39] L. Benini, D. Bruni, A. Macii, E. Macii, “Hardware-assisted data compression for energy
minimization in systems with embedded processors,” IEEE Design and Test in Europe,
pp. 449-453, March. 2002.
[40] C. Lefurgy, Efficient Execution of Compressed Programs, Doctoral Dissertation, Dept.
of CS and Eng., University of Michigan, 2000.
[41] D.Sylvester and K.Keutzer, “A global wiring paradigm for deep submicron design,”
IEEE Transactions on CAD/ICAS, vol.19, No. 2, pp. 242-252, February 2000.
[42] L. Benini and G. De Micheli, “Networks on chip: a new SoC paradigm,” IEEE
Computers, January 2002, pp. 70-78.
[43] H. Bakoglu, Circuits, Interconnections, and Packaging for VLSI, Addison-Wesley, 1990
[44] W. Dally and J. Poulton, Digital Systems Engineering, Cambridge University Press,
1998.
514 Energy-Efficient System-Level Design

[45] R. Hegde, N. Shanbhag, ”Toward achieving energy efficiency in presence of deep


submicron noise,” IEEE Transactions on VLSI Systems, pp. 379--391, vol. 8, no. 4,
August 2000.
[46] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks: an Engineering Approach.
IEEE Computer Society Press, 1997.
[47] D.Bertozzi, L. Benini and G. De Micheli, “Low-power error-resilient encoding for on-
chip data busses,” IEEE Design and Test in Europe, pp. 102-109, March 2002.
[48] B. Ackland et al., “A Single chip, 1.6-Billion, 16-b MAC/s multiprocessor DSP,” IEEE
Journal of Solid-State Circuits, vol. 35, no. 3, March 2000.
[49] H. Zhang, V. George, J. Rabaey, “Low-swing on-chip signaling techniques:
effectiveness and robustness,” IEEE Transactions on VLSI Systems, vol. 8, no. 3, pp.
264-272, June 2000.
[50] W. Bainbridge, S. Furber, “Delay insensitive system-on-chip interconnect using l-of-4
data encoding,” IEEE International Symposium on synchronous Circuits and Systems,
pp. 118-126, 2001.
[51] H. Zhang, M. Wan, V. George, J. Rabaey, “Interconnect architecture exploration for
low-energy configurable single-chip DSPs,” IEEE Computer Society Workshop on VLSI,
pp. 2-8, 1999.
[52] P. Aldworth, “System-on-a-chip bus architecture for embedded applications,” IEEE
International Conference on Computer Design, pp. 297-298, Nov. 1999.
[53] B. Cordan, ”An efficient bus architecture for system-on-chip design,” IEEE Custom
Integrated Circuits Conference, pp. 623-626, 1999.
[54] S. Winegarden, “A bus architecture centric configurable processor system,” IEEE
Custom Integrated Circuits Conference, pp. 627--630, 1999.
[55] R. Yoshimura, T. Koat, S. Hatanaka, T. Matsuoka, K. Taniguchi, “DS-CDMA wired bus
with simple interconnection topology for parallel processing system LSIs,” IEEE Solid-
State Circuits Conference, pp. 371-371, Jan. 2000.
[56] P. Guerrier, A. Grenier, “A generic architecture for on-chip packet-switched
interconnections,” Design Automation and Test in Europe Conference, pp. 250-256,
2000.
[57] C. Patel, S. Chai, S. Yalamanchili, D. Shimmel, “Power constrained design of
multiprocessor interconnection networks,” IEEE International Conference on Computer
Design, pp. 408-416, 1997.
[58] J. Walrand, P. Varaiya, High-Performance Communication Networks. Morgan Kaufman,
2000.
[59] Papadimitriou, M. Paterakis, “Energy-conserving access protocols for transmitting data
in unicast and broadcast mode,” International Symposium on Personal, Indoor and
Mobile Radio Communication, pp. 416--420, 2000.
[60] Y. Lu, L. Benini and G. De Micheli, “Power Aware Operating Systems for Interacting
Systems,” IEEE Transactions on VLSI, April 2002.
[61] H. Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, J. Rabaey, “A 1-V
Heterogeneous Reconfigurable DSP IC for Wireless Baseband Digital Signal
Processing,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp. 1697--1704, Nov.
2000.
[62] S. Manne, A. Klauser, D. Grunwald, “Pipeline gating: speculation control for energy
reduction,” International Symposium on Computer Architecture, pp. 122-131, July 1998.
[63] R. Panda et al., “Data memory organization and optimization in application-specific
systems,” IEEE Design \& Test of Computers, vol. 18, no. 3, pp. 56-68, May-June 2001.
Summary 515

[64] M. Lipasti, C. Wilkerson, and J. Shen, “Value locality and load value prediction,”
ASPLOS, pp.138-147, 1996
[65] G. Lakshminarayana, A. Raghunathan, K. Khouri, K. Jha, and S. Dey, “Common-case
computation: a high-level technique for power and performance optimization,” Design
Automation Conference, pp.56-61, 1999
[66] K. Lepak and M. Lipasti, “On the value locality of store instructions,” ISCA, pp. 182-
191, 2000
[67] S.E. Richardson, “Caching function results: faster arithmetic by avoiding unnecessary
computation,” Tech. report, Sun Microsystems Laboratories, 1992
[68] E.Y.Chung, L. Benini and G. De Micheli,”automatic source code specialization for
energy reduction,” ISLPED, IEEE Symposium on Low Power Electronics and Design,
2000, pp. 80-83.
[69] J.Crenshaw math Toolkit for Real-Time Programming, CMP Books, kansas, 2000.
[70] Peymandoust, and G. De Micheli, “Complex library mapping for embedded
software using symbolic algebra,” DAC, Design Automation Conference, 2002.
[71] Aho, R. Sethi, J. Ullman, Compilers. Principles, Techniques and Tools. Addison-
Wesley, 1988.
[72] H. Mehta, R. Owens, M. Irwin, R. Chen, D. Ghosh, “Techniques for low energy
software,” International Symposium on Low Power Electronics and Design, pp. 72-75,
Aug l997.
[73] G. Goossens, P. Paulin, J. Van Praet, D. Lanneer, W.Guerts, A. Kifli and C.Liem,
“Embedded software in real-time signal processing systems: design technologies,”
Proceedings of the IEEE, vol. 85, no. 3, pp. 436--54, March 1997.
[74] V. Tiwari, S. Malik, A. Wolfe, “Power analysis of embedded software: a first step
towards software power minimization,” IEEE Transactions on VLSI Systems, vol. 2,
no.4, pp.437--445, Dec. 1994.
[75] M. Lorenz, R. Leupers, P. Marwedel, T. Drager, G. Fettweis, “Low-energy DPS code
generation using a genetic algorithm,” IEEE International Conference on Computer
Design, pp. 431-437, Sept 2001.
[76] V. Tiwari, S. Malik, A. Wolfe, M. Lee, “Instruction level power analysis and
optimization of software,” Journal of VLSI Signal Processing, vol. 13, no.1-2, pp.223--
233, 1996.
[77] Su, C. Tsui, A. Despain, “Saving power in the control path of embedded processors,”
IEEE Design and Test of Computers, vol. 11, no. 4, pp. 24-30, Winter 1994.
[78] G. De Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, 1994.
[79] M. Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley,
1996.
[80] M. Kandemir, M. Vijaykrishnan, M. Irwin, W. Ye, “Influence of compiler optimizations
on system power,” IEEE Transactions on VLSI Systems, vol. 9, no. 6, pp. 801-804, Dec.
2001.
[81] H. Kim, M. Irwin, N. Vijaykrishnan, M. Kandemir, “Effect of compiler optimizations on
memory energy,” IEEE Workshop on Signal Processing Systems, pp. 663-672, 2000.
[82] Y. Lu, L. Benini and G. De Micheli, “Requester-Aware Power Reduction,” ISSS,
International System Synthesis Symposium, 2000, pp. 18-23.
[83] D. Shin, J. Kim, “A profile-based energy-efficient intra-task voltage scheduling
algorithm for hard real-time applications,” IEEE International Symposium on Low-
Power Electronics and Design, pp. 271-274, Aug.2001.
516 Energy-Efficient System-Level Design

[84] S. Chirokoff and C. Consel, “Combining program and data specialization,” ACM
S1GPLAN Workshop on Partial Evaluation and Semantics-Based Program Manipulation
(PEPM '99), pp.45-59, San Antonio, Texas, USA, January 1999
[85] D. Ditzel, ”Transmeta's Crusoe: Cool chips for mobile computing,” Hot Chips
Symposium
[86] R. Ho, K. Mai, M. Horowitz, “The future of wires,” Proceedings of the IEEE, January
2001.
[87] K. Lahiri, A. Raghunathan, G. Lakshminarayana, S. Dey, “Communication architecture
tuners: a methodology for the design of high-performance communication architectures
for systems-on-chip,” IEEE/ACM Design Automation Conference, pp. 513--518, 2000.
[88] H. Mehta, R. M. Owens, M. J. Irwin, “Some issues in gray code addressing,” Great
Lakes Symposium on VLSI, pp. 178--180, March 1996.
[89] Redhat, Linux-ARM math Library Reference Manual
[90] T. Theis, “The future of Interconnection Technology,” IBM Journal of Research and
Development, vol. 44, No. 3, May 2000, pp. 379-390.
[91] Wolfe, “Issues for low-power CAD tools: a system-level design study,” Design
Automation for Embedded System, vol. 1, no. 4, pp. 315-332, 1996.
[92] International Technology Roadmap for Semiconductors htt:///public.itrs.net/
[93] Cygnus Solutions, eCOS reference Manual, 1999
[94] D. Bertsekas, R. Gallager, Data Networks. Prentice Hall, 1991.
[95] J. Montanaro et al, “A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor,” IEEE
Journal of Solid-State Circuits, vol. 31, no. 11, pp. 1703--1714, Nov. 1996.
Index
adaptive forward error correction, 335 beamforming, 362, 365
adaptive power-supply regulation, 202, behavior-level, 441, 442
215, 218, 228, 232, 237 bit-line capacitance, 63, 75, 77
ADC, 121, 125, 128, 133, 138, 145, 148 bit-width analysis, 181, 187, 193, 198
adjustable radio modulation, 335 body bias, 13, 24, 35, 44, 48, 95, 401,
ADPCM, 190, 192, 193, 194, 483 405, 406, 411
algorithm, 118 body effect, 38, 65, 402, 407
A/D conversion, 126 body factor, 130
beamforming, 362 Boltzmann distribution, 16, 26
block-formation, 289 bus, 33, 52, 76, 186, 238, 444, 461, 479,
data processing, 339 481, 493, 495, 502, 509, 514
dynamic programming, 508 characteristic distance, 361
FIR filtering, 344 charge pump circuits, 54, 56, 65, 67
greedy, 381 charge sharing, 110, 132
instruction selection, 508 chip multiprocessor, 480
leakage current minimization, 407 clock buffer, 151, 153, 157, 171, 174, 177
local routing, 260 clock data recovery, 201
network control, 494 clock gating, 110, 151, 155, 159, 167,
non-adaptive, 377 172, 373, 388, 392, 396, 433, 440,
power-reduction, 304 442, 481
routing, 494 clock network, 153, 156, 160, 164, 167,
scheduing, 393 171, 174, 177, 433
scheduling, 305 clock synthesis, 212
simple bit manipulation, 506 clock tree, 116, 151, 154, 160, 162, 173,
speech coding, 468 177, 389, 439
static, 337 clustering, 280, 294, 361, 371, 485, 501
Viterbi, 352 CMOS
application software, 476, 498, 503, 510 circuit, 10, 305, 345
architectural optimizations, 485 gate, 2, 202, 409
architecture NAND, 27, 28, 418, 423, 432, 475
agile, 460 NOR, 159, 475, 484
hardware, 181, 285, 301, 338, 343, scaling, 201
364, 453, 455 technology, 2, 10, 15
reconfigurable, 470 CMOS technology
software, 285, 377 projection, 13
bandwidth optimization, 490, 491 combinational logic, 163, 174
battery, 1, 3, 8, 31, 53, 94, 151, 273, 293, compiler, 181, 189, 285, 289, 291, 295,
298, 335, 368, 370, 386, 392, 410, 435, 508, 515
427, 442, 446, 474, 499 computation accuracy, 181, 187, 197
battery-operated, 297 conditional flip-flop, 91, 110
518

constraint, 43, 227, 272, 277, 314, 350, embedded DRAM, 51, 82, 91, 117, 481,
352, 363, 380, 393, 395, 409, 454 511
area, 201 embedded systems, 5, 184, 188, 196, 242,
average data rate, 313 274, 298, 302, 320, 332, 476, 488,
average wait time, 384 503, 505
energy, 332, 393 energy band, 23
latency, 364 energy dissipation, 269, 271, 277, 335,
memory size, 486 359, 361, 363, 365, 382, 384, 392, 426
performance, 197, 242, 262, 272, 313, energy estimation, 277, 285, 293
379, 385, 454 energy minimization, 186, 509
power, 3, 5, 21, 26, 31, 34, 43, 309 energy scalability, 335, 338, 343, 354,
quality of service, 197, 198, 337, 393 369
re-use of hardware and software energy-aware computing, 241, 274
components, 242 energy-aware packet scheduling, 306,
stability, 224, 226 310, 312
system cost, 482 energy-efficient, 201, 204, 210, 214, 222,
timing, 3, 114, 242, 243, 311, 428 228, 235, 238, 261, 303, 318, 330,
content-addressable memory, 259, 260 337, 346, 353, 358, 363, 473, 484,
control protocols, 491, 493 488, 494, 500, 510, 515
cross talk, 209, 211, 215, 218 energy-quality scalability, 335
data aggregation, 335, 337, 362, 363 ensemble of point systems, 346
data-link layer, 493 environmental data, 421
datapath width adjustment, 182, 190 FDMA, 355, 356, 359
D-cache, 260, 265, 266, 489 feedback, 35, 66, 143, 159, 174, 177, 221,
delay-locked loop, 201, 212, 237 230, 368, 393, 411, 444
derivative gate-level tools, 436 Fermi potential, 405
design finite state machine, 270
methodologies, 3, 278, 474 fixed time-out, 378
platform-based, 7, 243, 273, 451, 463 flash, 51, 53, 61, 67, 71, 86, 124, 127,
tools, 3, 414, 438 133, 146, 149, 339, 367
design automation, 413, 414, 421, 443 flash memory
discrete doping effect, 33, 45 cell operation, 55
DLL, 212, 223, 224, 227, 229, 232, 238 NAND, 53, 59, 67
double edge-triggered, 151, 154, 157 NOR, 53, 58, 66, 71
double-edge triggered flip-flops, 154, NOR flash memory, 66
164, 174 flat-band voltage, 131
DPM. See dynamic power management flip-flop
DRAM, 24, 33, 44, 51, 74, 75, 77, 82, 88, dynamic power, 174
89, 118, 480, 481 frame length adaptation, 316, 317
cell, 74 frequency scaling, 500
indirect band-to-band tunnelling, 24 gate insulator, 11, 15, 18, 21, 23, 30, 32,
low voltage operation, 83 39, 43, 45
retention time, 44 gate-level tools, 431, 438, 439
dynamic datapath-width adjustment, 181, generator matrix, 380
187 greedy, 378
dynamic power management, 299, 305, Hamming distance, 350, 509
310, 313, 332, 373, 377, 409, 410, 500 hardware, 281
electromigration, 15, 448 profiling, 269
hazard, 389
519

timing, 470 on-chip, 485


high-impedance drivers, 207, 235 optimization technique, 486
high-performance flip-flops, 155, 159, partitioning, 488, 489
170 SRAM, 5, 31, 51, 106
high-speed, 63, 72, 83, 95, 108, 113, 119, volatile random access, 51
121, 124, 129, 131, 138, 142, 144, memory bandwidth, 490
149, 155, 201, 206, 209, 211, 217, memory cell operation, 58
218, 223, 228, 232, 236, 370, 492, 493 memory hierarchy, 485, 487, 491, 509
high-speed links, 201, 204, 212, 216, 221, memory systems, 492
223 micro-architecture, 453, 478, 502
hot electron, 55 programmable processors, 479
I-cache, 260, 265, 272, 286, 489 microsensor networks, 332, 335, 337,
information-flow management, 500, 502, 361, 369, 370
511 mixed-signal, 201, 215, 221, 482, 484
leakage modeling standards, 429
minimization, 387, 398 MTCMOS, 48, 402
power consumption, 182 multihop routing, 335
reduction, 435 multiple power supplies, 96, 98
leakage control, 281 multiple threshold voltages, 101
leakage current, 11, 15, 20, 23, 30, 34, 68, multiple transistor width, 96, 104
71, 80, 86, 92, 96, 101, 106, 154, 162, NAND, 28, 51, 59, 60, 64, 67, 70, 86,
167, 176, 183, 186, 231, 281, 298, 155, 162, 164, 389, 406, 418, 423
300, 304, 335, 366, 373, 399, 400, NAND flash memory, 53, 87
402, 405, 409, 411, 415 network architectures, 491, 493, 497
leakage current control, 373 network layer, 491, 493, 494, 497
level shifter, 54, 68, 70 non-driven cell plate line scheme, 79
link NOR, 51, 54, 58, 64, 68, 71, 86
high-speed, 203 NOR flash memory, 53, 57, 71
receiver, 203 optimization
transmitter, 203 gate-level, 432
link design, 201, 208, 214, 215, 219, 222, on-the-fly, 269, 270
228, 233, 235 optimization technique, 278, 380
logic embedding, 151, 155, 171, 175 architectural, 278
logic synthesis, 114, 115, 189, 193, 278 delay, 491
low power transceiver, 335, 341 memory, 487
low-power flip-flops, 157 power, 258
low-swing clock double-edge triggered software, 278, 487
flip-flop, 165 parallel links, 201, 204, 208, 213, 215
mapping, 242, 244, 253, 441, 506 parallelism, 121, 127, 202, 215, 222, 228,
Markov chain, 246, 250, 373, 380, 381 234, 258, 289, 363, 370, 479, 490
Markov decision process Petri net, 245, 381
discrete-time, 380, 384 phase-locked loop, 201, 341, 368
continuous-time, 380 physical layer, 349, 355, 493
Markov process, 246, 383 PLL, 212, 223, 225, 229, 232, 237, 239,
memory 341, 368
DRAM, 10 policy
embedded DRAM, 5, 52, 84, 118, 480 greedy, 377
flash, 52 predictive shut-down, 378
hierarchy, 489 stochastic, 379
520

time-out, 373, 378 transition, 246


wake-up, 379 processor energy model, 335
policy optimization, 373, 379, 381 quality-driven design, 181
power radio energy model, 335
average, 5, 121, 186, 256, 266, 437 radio power management, 297, 310, 313
density, 43 receiver, 204, 208, 210, 214,, 223, 228,
dynamic, 2, 10, 153, 201, 389, 402, 232, 237, 298, 301, 303, 311, 318,
415 341, 350, 357, 359, 368, 495, 497
instantaneous, 426 resource scaling, 272
leakage, 31, 41, 169, 182, 281, 387, SAN. See stocastic automata network
398, 409, 425, 435 scaling, 9, 15, 21, 26, 31, 38, 45, 49, 87,
maximum, 284, 441 95, 118, 125, 136, 141, 148, 201, 202,
minimum, 122, 148, 386 217, 227, 235, 239, 260, 270, 299,
peak, 2, 278, 369, 484 306, 314, 332, 344, 368, 395, 407, 411
short-circuit, 2,10 frequency, 368
static, 7, 10, 167, 341, 416, 437, 498 resource, 270, 271
power analysis, 242, 275, 294, 428, 432, voltage, 369, 500
438, 441 self-boosted programming technique, 61
power awareness, 304, 335, 345 semiconductor memories, 51, 221, 498
power estimation, 149, 258, 264, 269, sense amplifier, 84, 85, 88, 234, 281
276, 279, 295, 443 sensor networks, 297, 302, 324, 334, 342,
hardware, 269 349, 365, 370
power gating, 373, 401, 407 serial links, 201, 203, 208, 228, 232, 237
power management, 241, 268, 273, 295, short-circuit, 81
302, 310, 315, 320, 332, 374, 376, short-range radios, 302, 303, 331
379, 384, 409, 410, 476, 500, 510 signal processing, 122, 148, 181, 198,
power measurements, 303, 426, 428 238, 294, 303, 339, 363, 370, 393,
power modeling, 258, 260, 273, 279, 414 410, 480
power models, 259, 279, 283, 284, 423, simulation tools, 293
428, 438, 442, 447 single-chip voice recorder, 482
power optimization, 181, 187, 278, 288, single-ended, 159, 160, 206, 207, 208
440, 446, 486 slew rate, 143, 206, 209, 228
power reduction, 96, 117, 429 slew-rate control, 206
power sensitive design, 413 small-swing, 151, 157, 161, 165, 174, 176
power tools, 414, 417, 421, 431, 438, 447 soft-core processor, 181, 188, 190
behavior-level, 440 SOI, 12, 20, 36, 41, 46, 48
gate-level, 431 SRAM
register transfer-level, 438 cell, 31, 34, 145
transistor level, 430 low voltage, 106
power-aware communication, 297, 338 programmable FPGA, 462
power-aware protocols, 297 static power, 498
power-constrained, 11, 30, 477, 479 static
power-supply regulator, 201, 221, 228, power, 106
238 static logic, 343
pre-computation, 373, 397 static throttling methods, 272
predictive wake-up, 379 stationary, 263, 311
probability statistical power reduction flip-flops, 174
stable-state, 248 statistical power saving, 151, 160
steady-state, 254, 256 stochastic, 129, 382
521

automata network, 241, 243, 245 flash memory, 55, 59


characteristic of workload, 385 flat-band-voltage, 131
model of power-managed system, 373, floating-gate transistor, 58
384 leakage current, 399
model of power-managed service limitation, 86
providers, 380 multiple threshold voltages, 101
model of power-managed service multi-threshold, 96
requestors, 380 power efficiency, 65
Petrinet, 381 scale, 29, 231, 387, 399
power management, 381 scaling, 20
power management policy, 385 small gate length, 132
simulation, 26 SRAM, 106
sub-clock period symbols, 217 subthreshold slope, 92
substrate bias, 58, 91, 94, 95, 109 temperature, 44
subthreshold current, 169 threshold voltage roll-off, 20
supply voltage, 2, 20, 30, 53, 62, 70, 81, variable threshold, 5
92, 103, 121, 135, 142, 154, 162, 202, threshold voltage,, 91, 99
217, 222, 231, 270, 304, 332, 343, timing recovery, 212
364, 387, 394, 401, 411, 422, 443, TLB, 260, 287, 512
455, 462, 470, 495, 500 topology management, 319, 320, 321,
switching activity, 2, 31, 202 324, 328, 329, 332
clock driver, 390 transistor level tools, 419, 431, 438
flip-flop, 389 transistor mismatch, 136, 149
hardware block, 279, 466 transition matrix, 245, 380
input data, 158, 176 transmitter, 205, 212, 217, 223, 230, 234,
system level design 236, 301, 317, 341, 349, 357, 360,
application software, 473 368, 495, 497
system partitioning, 365 transport layer, 491, 494
system software, 499, 502, 510 tunnelling, 17, 20, 24, 30, 34, 41, 45, 86
TDMA, 355, 356, 359 VDSM, 151, 169, 174
technology data, 421 velocity saturation, 99
threshold video decoder, 182, 195, 249
multiple threshold voltage, 4 voltage scaling, 15, 16, 20, 91, 95, 120,
multi-threshold CMOS, 402 137, 140, 148, 201, 272, 298, 304,
threshold voltage, 4, 12, 16, 24 335, 343, 364, 368, 398, 411, 502
distribution, 56 wireless communications, 297, 298, 299,
dynamically varying, 406 300, 302, 304, 317, 332

You might also like