Subb Arao
Subb Arao
by
Subbarao Palacharla
A dissertation submitted in partial fulllment of
the requirements for the degree of
Doctor of Philosophy
(Computer Sciences)
at the
UNIVERSITY OF WISCONSINMADISON
1998
i
Abstract
The performance trade-off between hardware complexity and clock speed in the design
of superscalar microarchitectures is rst investigated. Using the results of this trade-off
analysis, the thesis proposes and evaluates two new superscalar microarchitectures
designed with the goal of achieving high performance by reducing complexity.
This thesis takes a step towards quantifying the complexity of superscalar microarchitec-
tures. First, a generic superscalar pipeline is dened. Then the specic areas of register
renaming, instruction window wakeup, instruction window selection, register le access,
and operand bypassing are analyzed. Each is modeled and Spice simulated for three differ-
ent feature sizes representing past, present, and future technologies. Performance results
and complexity trends are expressed in terms of issue width and window size. Results
show that instruction window logic and operand bypass logic are likely to be the most crit-
ical in the future.
Following the complexity analysis, we study a family of superscalar microarchitectures
called the dependence-based microarchitectures. These microarchitectures exploit natural
dependences occurring in programs to reduce the complexity of window logic and oper-
and bypass logic. Simulation results show that dependence-based superscalar microarchi-
tectures are capable of extracting similar levels of parallelism as a conventional
microarchitecture while facilitating a faster clock.
Finally, we propose and evaluate the integer-decoupled microarchitecture that improves
the performance of integer programs by minimally adding to a conventional microarchi-
tecture. Floating-point units in the conventional microarchitecture are augmented to per-
form simple integer operations and the resulting oating-point subsystem is used to
ii
support some of the computation in integer programs. Simulation results are presented that
show modest speedups for a 4-way processor. The speedups are attractive, however, con-
sidering that the proposed microarchitecture requires little additional hardware.
iii
Acknowledgments
First and foremost, I thank my parents and family for encouragement and support during
the seemingly endless stay in graduate school. I thank my dad for nudging me towards
graduate school and research.
I am indebted to Jim Smith, my advisor, for taking me as his student at a crucial juncture
in my graduate school career. I thank him for providing direction and for sharing his ideas.
Most of all, I enjoyed his style of loosely-coupled advising. I also thank him for gladly
answering my questions during countless walk-in meetings.
I also owe a lot to Norm Jouppi for helping me technically with the core of this disserta-
tion. He patiently answered my questions, some stupid ones too, about VLSI circuits. I
enjoyed many informative discussions about circuits and computer architecture with him.
His advice and help made this thesis possible.
I am especially grateful to Guri Sohi for providing me with an ofce and computing
facilities in Computer Sciences. I thank him for serving as a reader and for his critical
comments on the thesis.
I thank David Wood, Jim Goodman, and Charles Kime for serving on my committee. I
would like to especially thank David Wood for serving as a reader while on sabbatical and
for making numerous useful comments that greatly improved the presentation of the the-
sis.
Over the past six years, I have had the privilege of technically interacting with Jim
Goodman, Mark Hill, Rick Kessler, Norm Jouppi, Jim Smith, Guri Sohi, and David Wood.
I thank them for teaching me most of what I know about computer architecture.
iv
This dissertation has beneted from the work of other graduate students. Subramanya
Sastry implemented the compiler support for a part of the dissertation research. Todd Aus-
tin developed the toolset on which the simulators used in this dissertation are based on.
Alain Kgi always made time for helping me with Framemaker. I thank all of them.
I thank Scott Breach, Douglas Burger, Satish Chandra, Babak Falsa, Alain Kgi, and T.
N. Vijaykumar and for their friendship and company. Their camaraderie made life enjoy-
able and helped insulate me from occasional low points in life. Special thanks to Satish for
the innumerable trips to State Street. Thanks to Amir Roth for fun discussions about any-
thing and everything during the last year of my graduate studies. Outside of work, I thank
Ambuj Shatdal, Francis Valiyaveetil, and Jignesh Patel for their company.
Finally, I would like to thank the agencies that funded my graduate studies. My work
was supported in part by Wisconsin Alumni Research Foundation, NSF grants MIP-
9505853, University of Wisconsin Graduate School, the U.S. Army Intelligence Center
and Fort Huachuca under contract DABT63-95-C-0127 and ARPA order no. D346, Cray
Research Inc., and Digital Equipment Corporation - Western Research Laboratory.
v
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Historical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The Conventional Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Quantifying the Complexity of Superscalar Microarchitectures . . . . . . . . . 9
1.4.2 Dependence-based Superscalar Microarchitectures . . . . . . . . . . . . . . . . . 10
1.4.3 Integer-decoupled Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2. Quantifying the Complexity of Superscalar Microarchitectures . . . . . . . . . . 13
2.1 Sources of Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Basic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Current Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Technology Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Logic Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Wire Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Register Rename Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1.2 Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1.3 Spice Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.1.4 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2 Window Wakeup Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.2.2 Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.2.3 Spice Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.2.4 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.3 Window Selection Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.3.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.3.2 Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.3.3 Spice Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.3.4 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.4 Register file Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.4.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.4.2 Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.4.3 Spice Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.4.4.4 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.4.5 Data bypass logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.4.5.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.4.5.2 Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.4.5.3 Spice Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.4.5.4 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.4.5.5 Alternative Layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.5 Pipelining Issues and Overall Delay Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Chapter 3. Dependence-based Superscalar Microarchitectures . . . . . . . . . . . . . . . . . . . 85
3.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.2 Dependence-based Microarchitectures : An Example . . . . . . . . . . . . . . . . . . . . 89
3.2.1 Performance of the Fifo-based Microarchitecture . . . . . . . . . . . . . . . . . . . 92
3.2.2 Complexity Analysis of the Fifo-based Microarchitecture . . . . . . . . . . . . 94
3.2.3 Clustering the Fifo-based Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . 96
vii
3.2.4 Overall Performance of the Clustered Fifo-based Microarchitecture . . . . 97
3.2.5 Effect of Scaling Instruction and Data Cache Miss Latency . . . . . . . . . . . 99
3.3 Other Dependence-based Microarchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.3.1 Single Window, Multiple Execution Clusters, Execution-driven Steering 100
3.3.2 Multiple windows, Dispatch-driven Steering . . . . . . . . . . . . . . . . . . . . . 101
3.3.3 Complexity of Steering Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.4.1 Performance Relative to an Ideal Superscalar . . . . . . . . . . . . . . . . . . . . . 107
3.4.2 Effect of Increasing Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.4.3 Effect of Increasing Inter-cluster Latency . . . . . . . . . . . . . . . . . . . . . . . . 110
3.4.4 Inter-cluster Bypass Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.4.5 Comparing against In-order Distributed Reservation Stations . . . . . . . . 112
3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 4. Integer-Decoupled Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.2 Changes to the Conventional Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . 122
4.3 Partitioning the Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.4 Basic Partitioning Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4.1 Terminology and Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.4.2 Partitioning Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.4.3 Partitioning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.5 Advanced Partitioning Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.5.1 Limitations of the Basic Partitioning Scheme . . . . . . . . . . . . . . . . . . . . . 131
4.5.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.5.3 Algorithm for Introducing Copies and Duplicating Code . . . . . . . . . . . . 134
4.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.6.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.6.2 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Chapter 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.2.1 Quantifying the Complexity of Superscalar Microarchitectures . . . . . . . 148
5.2.2 Dependence-based Superscalar Microarchitectures . . . . . . . . . . . . . . . . 148
5.2.3 Integer-decoupled Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A.1 Technology Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A.2 Delay Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
B.1 Register Rename Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
B.2 Window Wakeup Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
B.3 Window Selection Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
B.4 Register File Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
B.5 Data Bypass Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
ix
List of Figures
Figure 1-1. A typical superscalar microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Figure 1-2. Time line showing evolution of superscalar processors . . . . . . . . . . . . . . . . . 5
Figure 2-1. Baseline superscalar model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 2-2. Reservation stations-based superscalar model. . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 2-3. Register rename logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 2-4. Renaming example showing dependency checking . . . . . . . . . . . . . . . . . . . 28
Figure 2-5. Rename map table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 2-6. Decoder structure and equivalent circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure 2-7. Wordline structure and equivalent circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 2-8. Bitline structure and equivalent circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 2-9. Rename delay versus issue width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Figure 2-10. Model delay results for rename logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 2-11. Window wakeup logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 2-12. CAM cell in wakeup logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 2-13. Tag drive structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 2-14. Tag match structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure 2-15. Logic for ORing individual match signals . . . . . . . . . . . . . . . . . . . . . . . . . 45
Figure 2-16. Wakeup logic delay versus window size . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 2-17. Wakeup logic delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Figure 2-18. Wakeup delay versus feature size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 2-19. Model delay results for wakeup logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Figure 2-20. Selection logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 2-21. Handling multiple functional units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Figure 2-22. Arbiter Logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 2-23. Selection delay versus window size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 2-24. Model delay results for selection logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 2-25. Register file logic delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Figure 2-26. Breakup of register file delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
x
Figure 2-27. Model delay results for register file logic.. . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 2-28. Bypass logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 2-29. Bypass logic equivalent circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 2-30. Inserting buffers in the result wires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Figure 2-31. Bypass logic delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 2-32. Model delay results for bypass logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 2-33. Alternative layouts for bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Figure 2-34. Pipelining wakeup and select. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Figure 2-35. Effect of pipelining on IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Figure 2-36. Overall delay results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Figure 3-1. Dependence-based superscalar microarchitecture. . . . . . . . . . . . . . . . . . . . . 87
Figure 3-2. Fifo-based microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Figure 3-3. Instruction steering example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Figure 3-4. Performance of single-cluster fifo-based microarchitecture . . . . . . . . . . . . . 92
Figure 3-5. Fifo utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Figure 3-6. Fifo-based microarchitecture with two clusters. . . . . . . . . . . . . . . . . . . . . . . 96
Figure 3-7. Performance of the clustered fifo-based microarchitecture. . . . . . . . . . . . . . 97
Figure 3-8. Potential improvements with the fifo-based microarchitecture. . . . . . . . . . . 98
Figure 3-9. Effect of Scaling Instruction and Data Cache Miss Latency. . . . . . . . . . . . . 99
Figure 3-10. Other dependence-based microarchitectures . . . . . . . . . . . . . . . . . . . . . . . 101
Figure 3-11. Fifo steering hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Figure 3-12. Performance of dependence-based superscalar microarchitectures . . . . . 107
Figure 3-13. Effect of increasing number of clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Figure 3-14. Effect of increasing inter-cluster latency. . . . . . . . . . . . . . . . . . . . . . . . . . 110
Figure 3-15. Inter-cluster bypass frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Figure 3-16. Comparing against in-order distributed reservation stations. . . . . . . . . . . 112
Figure 4-1. An example program fragment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Figure 4-2. Code partitioning for example fragment . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Figure 4-3. Program slices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
xi
Figure 4-4. Static dependence graph for example program. . . . . . . . . . . . . . . . . . . . . . 128
Figure 4-5. Partitioning with copies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Figure 4-6. Partitioning with code duplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Figure 4-7. Percentage of instructions assigned to Comp . . . . . . . . . . . . . . . . . . . . . . . 138
Figure 4-8. Speedups on the 4-way machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Figure 4-9. Speedups on the 8-way machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Figure 4-10. Instruction mix of the Comp partition. . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
xii
xiii
List of Tables
Table 2.1: Terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Table 2.2: Fan-in of decoder gates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Table 3.1: Delay of reservation table in 0.18m technology. . . . . . . . . . . . . . . . . . . . . . 95
Table 3.2: Baseline simulation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Table 3.3: Various microarchitectures simulated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Table 4.1: Extra opcodes supported in the Comp subsystem. . . . . . . . . . . . . . . . . . . . 123
Table 4.2: Machine parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Table 4.3: Benchmark programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Table A.1: Spice parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Table A.2: Metal resistance and capacitance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Table A.3: Break down of rename delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Table A.4: Break down of window wakeup delay for 0.8m technology. . . . . . . . . . . 163
Table A.5: Break down of window wakeup delay for 0.35m technology. . . . . . . . . . 164
Table A.6: Break down of window wakeup delay for 0.18m technology. . . . . . . . . . 165
Table A.7: Break down of selection delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Table A.8: Overall delay results for 0.8m technology. . . . . . . . . . . . . . . . . . . . . . . . 167
Table A.9: Overall delay results for 0.35m technology. . . . . . . . . . . . . . . . . . . . . . . 167
Table A.10: Overall delay results for 0.18m technology. . . . . . . . . . . . . . . . . . . . . . 167
Table B.1: Constants in decoder delay equation for rename logic. . . . . . . . . . . . . . . . 169
Table B.2: Constants in wordline delay equation for rename logic. . . . . . . . . . . . . . . . 170
Table B.3: Constants in bitline delay equation for rename logic. . . . . . . . . . . . . . . . . . 170
Table B.4: Constants in total delay equation for rename logic. . . . . . . . . . . . . . . . . . . 171
Table B.5: Constants in tag drive delay equation for wakeup logic. . . . . . . . . . . . . . . 171
Table B.6: Constants in tag match delay equation for wakeup logic. . . . . . . . . . . . . . 172
Table B.7: Constants in match OR delay equation for wakeup logic. . . . . . . . . . . . . . 172
Table B.8: Constants in total delay equation for wakeup logic. . . . . . . . . . . . . . . . . . . 173
Table B.9: Constants in total delay equation for selection logic. . . . . . . . . . . . . . . . . . 174
Table B.10: Constants in decoder delay equation for register file logic. . . . . . . . . . . . 174
xiv
Table B.11: Constants in wordline delay equation for register file logic. . . . . . . . . . . . 175
Table B.12: Constants in bitline delay equation for register file logic. . . . . . . . . . . . . . 175
Table B.13: Constants in total delay equation for register file logic. . . . . . . . . . . . . . . 176
Table B.14: Constants in total delay equation for data bypass logic. . . . . . . . . . . . . . . 176
1
Chapter 1
Introduction
1.1 Motivation
Over the past decade superscalar microprocessors have become a source of tremendous
computing power. They form the core of a wide spectrum of high-performance computer
systems ranging from desktop computers to small-scale parallel servers to massively-par-
allel systems. To satisfy the ever-growing need for higher levels of computing power, com-
puter architects need to investigate techniques that continue improving the performance of
superscalar microprocessors while considering both changing technology and applica-
tions.
Superscalar microarchitectures [Joh91, SS95], on which superscalar microprocessors
are based, deliver high performance by executing multiple instructions in parallel every
cycle. Hardware is used to detect and execute parallel instructions. This technique of
exploiting ne-grain parallelism at the instruction level to improve performance is com-
monly referred to as instruction-level parallelism. The maximum number of instructions
processed in parallel, also known as the width of the microarchitecture, is typically four
for the fastest microprocessors [Gwe96a, Kum96] available today. A typical superscalar
2
microarchitecture, illustrated in Figure 1-1, operates as follows. Multiple instructions are
fetched from the instruction cache every cycle. The instructions are then decoded, checked
for dependences, renamed, and deposited in an instruction window. The instructions wait
in the instruction window for their operands and functional units to become available.
Hardware continuously monitors the dependences between instructions in the window and
selects appropriate instructions for parallel execution. The overall hardware apparatus
responsible for creating the window, monitoring dependences between instructions in the
window, selecting instructions for execution from the window, and providing data oper-
ands to the instructions, henceforth collectively referred to simply as issue logic, is one of
the most performance-critical components in a superscalar processor. The issue logic
largely determines the amount of instruction-level parallelism that can be extracted.
Hence, optimizing this logic is of paramount importance.
The net performance of a superscalar microarchitecture is directly proportional to the
product . Instructions Per Cycle or IPC is the
sustained number of instructions executed in parallel every cycle. IPC depends on a num-
ber of factors including the inherent parallelism in the program, the width of the microar-
chitecture, the size of the instruction window, and other characteristics of the scheme used
for extracting parallelism. Clock Frequency is the speed at which the microarchitecture is
Figure 1-1. A typical superscalar microarchitecture.
I-Cache
Fetch
Rename
Load/Store
ALU/Branch
ALU
Integer
Window
Integer
Regs
FP
Window
FP
Regs
FP Mult
FP Add
Write
Buffer
Write
Buffer
TLB
Data
Cache
Instructions Per Cycle Clock Frequency
3
clocked and is determined by the delays associated with the signicant critical paths in the
microarchitecture.
For the past decade, the general approach for improving the performance of superscalar
microprocessors has been to build microarchitectures with increasingly complex issue
logic that can boost the IPC factor in the performance equation. The increase in complex-
ity results from a wider microarchitecture, a bigger instruction window, and more complex
issue methods. However, there is a potential problem with continuing this strategy. While
complex issue logic might be able to extract more parallelism, it can easily limit the clock
speed of the microarchitecture. Microarchitectures with more complex issue logic typi-
cally require longer wires and deeper levels of logic to implement, and hence, can require
longer critical paths in the microarchitecture. Thus, there is a danger of squandering the
gains in IPC to a slow clock, resulting in reduced benets or even no benet in overall per-
formance. Furthermore, technology trends suggest that wire delays will increasingly dom-
inate total delay as feature sizes are reduced. These factors suggest that straightforward
scaling of current microarchitectures for higher IPCs might not be the most appropriate
approach for delivering higher performance in future. In summary, there is a trade-off
between issue logic complexity, instructions per cycle (IPC), and clock speed that needs to
be carefully examined while designing improved superscalar microarchitectures. This the-
sis examines this trade-off.
The above discussion underscores the need for investigating superscalar microarchitec-
tures that judiciously use hardware complexity for exploiting signicant levels of instruc-
tion-level parallelism while permitting a fast clock. We call such microarchitectures
complexity-effective superscalar microarchitectures. These microarchitectures attempt to
maximize the product of IPC and Clock Frequency rather than push the envelope for each
term separately. This thesis proposes and evaluates two such complexity-effective super-
scalar microarchitectures called dependence-based microarchitectures and integer-decou-
pled microarchitectures.
4
It must be mentioned that the complexity of a design can have different, sometimes con-
icting meanings. To a verication engineer, design A is more complex than design B if
the time taken to verify design A is greater than that for design B. On the other hand, a
logic designer typically measures complexity in terms of the number of gates required to
implement a design. In this thesis, complexity is measured as the delay of the critical path
through a piece of logic, and the longest path through any of the pipeline stages deter-
mines the clock speed. Complexity, as we dene it, is largely independent of the number
of gates required or the time to verify the design. Instead, complexity is dependent on a
number of factors that could affect the delay of the critical paths in the design such as the
number of logic stages, the length of wires, the degree of fan-out of a particular signal, and
the number of associative compares performed every cycle.
While designing for complexity-effectiveness is a desirable goal, the question that
immediately arises is: how do we quantify the complexity of a microarchitecture? It is
commonplace to measure the IPC of a new microarchitecture, typically by using simula-
tion. Such simulations count clock cycles and provide IPC in a direct manner. However,
the complexity of a microarchitecture is much more difcult to determine to be very
accurate, it requires a full implementation in a specic technology. What is very much
needed are fairly straightforward measures of complexity that can be used by microarchi-
tects at a fairly early stage of the design process. Such methods would allow the determi-
nation of complexity-effectiveness. This thesis takes a step in the direction of
characterizing complexity and complexity trends.
1.2 Historical Perspective
This section briey outlines the evolution of ILP processors, especially superscalar pro-
cessors, while highlighting major trends in design trade-offs involving hardware complex-
ity and performance. Figure 1-2 illustrates the evolution of ILP processors with a time
line.
5
Pipelining [Kog81] is the most prevalent technique for exploiting instruction-level paral-
lelism. Pipelining enables overlapped execution of multiple instructions by breaking
instruction processing into segments, just like an assembly line. It was rst implemented
in the IBM Stretch [Buc62] in 1961. Ever since, pipelining has been adopted by almost all
high-performance designs.
The 1960s saw two pioneering machines that laid the foundation for much of the ILP
techniques in wide use today. These were the CDC 6600 [Tho61,Tho63] and the IBM 360/
91 [AST67] machines delivered in 1964 and 1967 respectively. The CDC 6600 imple-
mented an impressive repertoire of architectural techniques, especially for its time a
clean load/store instruction set that enabled efcient pipelining, multiple functional units,
and scoreboarding logic for dynamic scheduling. In the IBM 360/91 oating-point sub-
system, the designers implemented a more sophisticated issuing scheme known as Toma-
sulos algorithm [Tom67] after its inventor. The issuing schemes of most current
superscalar microprocessors can be viewed as variants of Tomasulos scheme. Even
though the two designs implemented out-of-order execution, they were both single issue
machines. Out-of-order execution was used to overlap execution of long-latency opera-
tions, tolerate slow memory accesses, and, in the case of the 360/91, mitigate the perfor-
mance drawbacks of having few (8) oating-point registers.
Figure 1-2. Time line showing evolution of superscalar processors.
1
9
6
1
1
9
6
4
1
9
6
7
1
9
8
0
1
9
9
0
1
9
9
5
C
D
C
6
6
0
0
I
B
M
3
6
0
/
9
1
I
B
M
A
m
e
r
i
c
a
I
B
M
P
o
w
e
r
1
D
E
C
2
1
0
6
4
,
H
P
7
1
0
0
,
I
n
t
e
l
P
e
n
t
i
u
m
M
I
P
S
R
1
0
0
0
0
,
I
n
t
e
l
P
e
n
t
i
u
m
P
r
o
D
E
C
2
1
2
6
4
I
B
M
S
t
r
e
t
c
h
A
s
t
r
o
n
a
u
t
i
c
s
Z
S
-
1
,
W
i
s
c
o
n
s
i
n
P
I
P
E
I
B
M
A
C
S
1
9
7
1
6
Soon after, both IBM and CDC reverted back to simpler in-order issue, pipelined
machines with a fast clock. The follow-on machines, the CDC 7600 and the IBM 360/85,
issued instructions strictly in order. The exact reasons for this reversal are not known, but
issues like the difculty of debugging complex issue methods and the extra hardware cost
are likely considerations on which the decision was based. Also, the use of a cache in the
IBM 360/85 to tolerate memory latency probably made out-of-order execution less attrac-
tive. Two decades later, mushrooming transistor budgets, advanced CAD tools, and the
market for high-performance, would trigger the resurgence of 6600 and 360/91-like
schemes in the context of superscalar microprocessors.
The 1970s was not an eventful decade for ILP processors. All commercial machines still
had a peak fetch rate of one instruction per cycle. However, during this time, some of the
initial research in the area of multiple-instruction issue [TF70,RF72,Sch71] was carried
out. Schorr describes an exploratory design [Sch71] capable of fetching, decoding, and
executing multiple instructions every cycle. The design, later to be known as the IBM ACS
(Advanced Computer System), was partitioned into the index unit that performed address-
ing operations and the arithmetic unit that executed arithmetic instructions. The arithmetic
unit had a window of eight instructions out of which three instructions could be issued for
execution every cycle. Unfortunately, the project was cancelled due to the incompatibility
of the ISA with the S/360 ISA and other problems.
The late 1970s saw the emergence of a new paradigm for ILP called VLIW Very
Long Instruction Word that grew out of early microcode machines [Wil51] and systems
built by Floating Point Systems [Cha81]. VLIWs rely on the compiler to pack independent
operations into a long instruction word which are then executed on multiple, independent
functional units. The arguments in favor of VLIW are two-fold. First, since the compiler
has a larger scope than the hardware to look for independent operations, VLIWs should be
able to exploit more parallelism than superscalars. Second, since complex issue hardware
is no longer required, VLIW processors can be clocked much faster than superscalar pro-
cessors. However, even though a few commercial VLIW processors were built, the para-
7
digm has not gained widespread acceptance. There are a number of reasons. First, to
match hardware techniques, the paradigm requires sophisticated compiler technology that
implements advanced techniques like software pipelining, global scheduling to move
instructions across branches, trace scheduling [Fis81], and memory disambiguation.
While advanced VLIW compilers [Ell85] that focussed on oating-point codes have been
developed, it is not clear how well they perform on integer code where branches occur fre-
quently and memory disambiguation is hard. Second, exposing hardware details to the
compiler results in binaries that might not be portable across implementations. Third, the
sophisticated transformations tend to result in increases in code size that can potentially
degrade overall performance.
The lack of ILP innovation continued into the early 1980s. This was the period when
most microprocessor designers were busy implementing RISC concepts [PS81] in the
form of simple pipelining, and new ILP techniques did not receive much attention. How-
ever, the second half of the 1980s saw renewed ILP activity both in the superscalar and
VLIW areas. The commercial implementations of the VLIW concept Trace [CNO
+
88]
by Multiow and Cydra 5 [RYYT89] by Cydrome were delivered during this time.
However, these implementations had limited success in penetrating commercial markets.
At the same time, three experimental superscalar prototype [S
+
87,GHL
+
85,Gro90] efforts
were underway. These were the Astronautics ZS-1, the Wisconsin PIPE, and the IBM
America machines. All three of them, implemented a limited form of multiple issue
integer instructions, including memory access related instructions, were issued in parallel
with oating-point instructions. The ZS-1 and the PIPE used architectural queues to com-
municate values between the two classes of instructions. The America design used register
renaming to achieve the same effect. All the designs still used in-order issue to execute
instructions within each class. This simplied issue logic while allowing a limited form of
out-of-order execution.
The early 1990s saw a number of superscalar implementations [KM89,
D
+
92,K
+
93,Hsu94] Intel i860, DEC 21064, HP 7100, MIPS R8000, and others. All of
8
them, with the exception of the Power1, were simple in-order implementations that
achieved multiple-issue by executing instructions of different types (load/store, branch,
oating-point) in parallel. The IBM Power1 [Gro90] based on the earlier America design
implemented register renaming and sophisticated instruction fetch mechanisms. Other
vendors continued on the path of simple in-order implementations with a faster clock. This
gave rise to the speed demons (simple implementations with a fast clock) versus braini-
acs (complex implementations with a slow clock) controversy [Gwe93].
The mid 1990s saw some convergence between the two camps. Almost all vendors
moved towards designs implementing complex out-of-order microarchitectures based on
the 6600 and 360/91 schemes as well as ideas explored in academia
[SP88,Soh90,HP86,DT92,YP92]. At the time of the writing of this thesis, every major
microprocessor vendor has a product implementing sophisticated dynamic scheduling.
In 1996, Digital Equipment Corporation, long considered to be the bastion of the speed
demons, announced plans for a product (DEC 21264 [Gwe96a]) implementing an out-of-
order microarchitecture with a relatively fast clock (600 MHz). An interesting feature that
stands out in this design is the microarchitectural changes employed to facilitate a fast
clock. The integer subsystem is partitioned into two clusters. Instructions are steered from
a central window to the clusters. Each cluster has its own copy of the register le. In addi-
tion to reducing the number of register le ports, clustering also makes possible fast
bypassing between units in the same cluster. These features are described in more detail in
Chapter 3. The research presented in this thesis has been highly inuenced by this design.
In summary, the superscalar approach
1
has evolved over the years into the mainstream of
processor implementations and each generation of designers had to deal with the trade-off
between hardware complexity and performance.
1. There have been other ILP paradigms, some very successful in their own niche market, that have
not been touched upon in this section. Some of these paradigms are vectors [Rus78], superpipe-
lining [JW89], autotasking[ABHS89], multiprocessing[FJD80], and dataow [DM74].
9
1.3 The Conventional Microarchitecture
As discussed earlier, current superscalar processors, like the MIPS R1000 [Yea96] and
the DEC 21264 [Gwe96a], are typically based on the microarchitecture shown in
Figure 1-1. The issue and execution resources in the machine are partitioned into integer
and oating-point subsystems. The integer subsystem contains a number of load/store,
branch, and functional units that operate on integer operands. The oating-point sub-
system is similar to the integer subsystem except it does not contain load/store units, and it
operates on oating-point operands. Instruction windows in each subsystem buffer
instructions and implement dynamic scheduling as discussed earlier.
The microarchitecture presented in Figure 1-1 will be referred to as the conventional
microarchitecture throughout the rest of this thesis. It will be used as a baseline for perfor-
mance comparisons.
1.4 Thesis Contributions
1.4.1 Quantifying the Complexity of Superscalar Microarchitectures
The main contribution of this thesis is the development of simple models that both quan-
tify the complexity of superscalar microarchitectures and identify complexity trends. Mea-
surement of implementation complexity of microarchitectural features is going to be
increasingly crucial for computer architects to understand and master. While much work
remains to be done in this area, the work presented in this thesis is an important starting
point.
The structures in a baseline superscalar microarchitecture whose complexity grows with
increasing instruction-level parallelism are identied and analyzed. Each is modeled and
Spice simulated for three different feature sizes representing past, present, and future tech-
nologies. Simple analytical models are developed that quantify the delay of these struc-
tures in terms of microarchitectural parameters of window size and issue width. The
10
impact of technology trends towards smaller feature sizes is studied. In particular, the
impact of poor scaling of wire delays in future technologies is analyzed.
In addition to delays, we study the performance effects of pipelining critical structures.
Even if the delay of a structure is relatively large, it may not increase the complexity of the
design because the structures operation can be spread over multiple pipestages. Our anal-
ysis identies structures that are more performance critical. The operation of these struc-
tures should be accommodated within a single cycle to avoid signicant degradation in
IPCs achieved, especially for programs with limited parallelism.
Our analysis shows that the issue window logic and data bypass logic are going to be the
most critical structures in future. The delay of the issue window logic increases at least lin-
early with both issue width and window size. The functioning of this logic involves broad-
casting of multiple tags on long wires spanning the window an operation that does not
scale well in future technologies. Furthermore, the delay of the window logic must t in a
pipestage to avoid performance degradation. Hence, this logic can be a key limiter of
clock speed as we move towards wider issue widths, large window sizes, and advanced
technologies in which wire delays dominate total delay. Another structure that can poten-
tially limit clock speed especially in future technologies is the data bypass logic. The
result wires that are used to bypass operand values increase in length as the number of
functional units is increased. This results in a quadratic dependence of the bypass delay on
issue width. Utilizing buffers helps mitigate the problem to an extent, but a linear increase
in delay with issue width still persists. Just like the window logic, data bypass logic must
also complete within a single cycle for performance reasons. Hence, bypass delays could
ultimately become signicant and force architects to consider more decentralized organi-
zations.
1.4.2 Dependence-based Superscalar Microarchitectures
This thesis studies a new family of complexity-effective microarchitectures called
dependence-based superscalar microarchitectures that address two major sources of com-
11
plexity window logic and data bypass logic in conventional microarchitectures.
Dependence-based microarchitectures use two main techniques to achieve the dual goals
of high IPC and a fast clock. First, the machine is partitioned into multiple clusters each of
which contains a slice of the instruction window and execution resources of the whole pro-
cessor. This enables high-speed clocking of the clusters since the narrow issue width and
the small instruction window of each cluster keeps critical delays small. The second tech-
nique involves intelligent steering of instructions to the multiple clusters so that the full
width of the machine is utilized while minimizing the performance degradation due to
slow inter-cluster communication.
A number of design alternatives and steering heuristics for dependence-based microar-
chitectures are proposed and evaluated using simulations. Among the designs presented,
one that is particularly attractive is what we call the fo-based microarchitecture. This
microarchitecture implements the instruction window as a collection of a small number of
fos and steers dependent chains of instructions to the same fo. Simulations show little
slowdown as compared with a completely exible issue window when performance is
measured in clock cycles. Furthermore, because only instructions at fo heads need to be
awakened and selected, issue logic is simplied and the clock cycle is faster conse-
quently overall performance is improved. For example, our results show that, due to the
clock speed advantage, the overall performance of a 2X4-way
1
fo-based microarchitec-
ture is 14% higher than that of a typical 8-way superscalar even though the proposed
microarchitecture degrades IPC performance by 8% relative to the typical microarchitec-
ture. By grouping dependent instructions together, the fo-based microarchitecture also
helps minimize the performance degradation due to slow bypasses in future wide-issue
machines.
1. A 8-way microarchitecture comprising two clusters each consisting of four fos feeding four
functional units.
12
1.4.3 Integer-decoupled Microarchitecture
This thesis proposes another complexity-effective microarchitecture called the integer-
decoupled microarchitecture that improves the performance of integer programs and can
be integrated into a conventional microarchitecture with little or no increase in complexity.
The integer-decoupled microarchitecture starts with a conventional microarchitecture and
augments the oating-point units to perform simple integer operations. Some integer
instructions, those not used for computing addresses and accessing memory, are then off-
loaded to the augmented oating-point subsystem by the compiler. Consequently, for inte-
ger programs, the integer-decoupled microarchitecture provides a larger window for
dynamic scheduling as well as extra issue and execution bandwidth at no increase in com-
plexity.
We evaluate the potential performance improvements with the integer-decoupled
microarchitecture. Our results show that a modest to signicant fraction of the total
dynamic instructions in our benchmark programs can be off-loaded to the augmented
oating-point subsystem. In doing so, the integer-decoupled microarchitecture provides
speedups from 3% to 23% over a 4-wide (2 integer and 2 oating-point units) conven-
tional microarchitecture. Furthermore, the results show that only simple integer operations
need to be supported in the oating-point subsystem. This minimizes the additional hard-
ware cost.
1.5 Thesis Organization
The remainder of this thesis is organized as follows. Chapter 2 describes the simple
models that we developed, along with the methodology used, for quantifying the complex-
ity of superscalar microarchitectures. Chapter 3 proposes and evaluates dependence-based
superscalar microarchitectures. Chapter 4 introduces and investigates the integer-decou-
pled microarchitecture. Finally, Chapter 5 gives conclusions and suggests future directions
to explore. The appendices includes detailed experimental results for Chapter 2.
13
Chapter 2
Quantifying the Complexity of Superscalar Microarchi-
tectures
The complexity of a microarchitecture is difcult to determine to be very accurate, it
would require a full implementation in a specic technology. What is very much needed
are fairly straightforward measures, possibly only relative measures, of complexity that
can be used by microarchitects at a fairly early stage of the design process. This chapter
presents work that takes a step in that direction. Simple models that quantify the complex-
ity of superscalar microarchitectures are developed and used to identify long-term com-
plexity trends.
We start by identifying those portions of a microarchitecture whose complexity grows
with increasing instruction-level parallelism. Of these, we focus on register rename logic,
window logic, register le logic, and data bypass logic. We analyze potential critical paths
in these structures and develop models for quantifying their delays. We study the manner
in which these delays vary with microarchitectural parameters like window size (the num-
ber of instructions from which ready instructions are selected for issue) and issue width
14
(the number of instructions that can be issued in a cycle). We also study the impact of the
technology trend towards smaller feature sizes. In particular, we analyze how the poor
scaling of wire delays in future affects the overall delay of critical structures.
In addition to delays, we study the performance effects of pipelining critical structures.
Even if the delay of a structure is relatively large, it may not increase the complexity of the
design because the structures operation can be spread over multiple pipestages. We ana-
lyze structures to identify those whose operation must be accomplished within a single
cycle to avoid signicant degradation in the number of instructions committed every cycle.
The rest of this chapter is organized as follows. Section 2.1 describes the sources of
complexity in a baseline microarchitecture. Section 2.2 describes the methodology we use
to study the critical structures identied in Section 2.1. Section 2.3 briey discusses tech-
nology trends. Section 2.4 presents a detailed analysis of each structure and how the delay
of the structure varies with microarchitectural parameters and technology parameters.
Section 2.5 discusses pipelining issues for each of the structures and presents overall delay
results. Finally, Section 2.6 lists related work, and Section 2.7 summarizes the chapter.
2.1 Sources of Complexity
Before delving into specic sources of complexity, we describe the baseline superscalar
model assumed for the study. We then list the basic structures that are the primary sources
of complexity. Finally, we show how these basic structures are present in one form or
another in most current implementations even though these implementations might appear
to be different supercially. On the other hand, we realize that it is impossible to capture
all possible microarchitectures in a single model and any results provided here have some
obvious limitations. We can only provide a fairly straightforward model that is typical of
most current superscalar processors, and suggest that techniques similar to those used here
can be extended for other, more advanced models as they are developed.
15
Figure 2-1 illustrates the baseline model and the associated pipeline. The fetch unit
fetches multiple instructions every cycle from the instruction cache. Branches encountered
by the fetch unit are predicted. Following instruction fetch, instructions are decoded and
their register operands are renamed. Register renaming involves mapping the logical regis-
ter operands of an instruction to the appropriate physical registers. Renamed instructions
are then deposited in the issue window, where they wait for their source operands and the
appropriate functional unit to become available. As soon as these conditions are satised,
the instruction is issued and executes on one of the functional units. The operand values of
an instruction are either fetched from the register le or are bypassed from earlier instruc-
tions in the pipeline. The data cache provides low latency access to memory operands via
loads and stores.
The issue window is responsible for monitoring dependences between instructions in the
window and issuing instructions to the functional units. The window logic consists of two
components the wakeup logic and the select logic. The rst component is responsible
for waking up instructions waiting in the issue window for their source operands to
become available. Once an instruction is issued for execution, the tag corresponding to its
result is broadcast to all the instructions in the window. Each instruction in the window
compares the tag with its source operand tags. Once all the source operands of an instruc-
tion are available the instruction is agged ready for execution. The select logic is respon-
sible for selecting instructions for execution from the pool of ready instructions. An
Figure 2-1. Baseline superscalar model.
WAKEUP
SELECT
FETCH RENAME REG READ COMMIT
F
E
T
C
H
R
E
N
A
M
E
W
A
K
E
U
P
S
E
L
E
C
T
R
E
G
F
I
L
E
D
A
T
A
C
A
C
H
E
WINDOW FUNC. UNITS
+
*
EXECUTE
BYPASS
DCACHE
ACCESS
B
Y
P
A
S
S
16
instruction is said to be ready if all of its source operands are available. As pointed out ear-
lier, the wakeup logic is responsible for setting the ready ag.
2.1.1 Basic Structures
The most important criterion used for identifying a basic structure for our study is that
the delay of the structure should be a function of either issue window size or issue width or
both. For example, we consider register renaming to be a basic structure because its delay
depends on the number of ports into the mapping table which in turn is determined by the
issue width. On the other hand none of the functional units are included in the study
because their delay is independent of both the issue width and the window size. In addi-
tion, our decision to study a particular structure was based on two observations. First, we
are primarily interested in dispatch and issue-related structures because these structures
form the core of a microarchitecture and largely determine the amount of parallelism that
can be exploited. Second, some of these structures rely on broadcast operations on long
wires and hence, their delays might not scale as well as logic-intensive structures in future
technologies with smaller feature sizes. Hence, we believe that these structures are poten-
tial cycle-time determinants in future wide-issue designs in advanced technologies.
The structures we consider are:
Register rename logic
Window wakeup logic
Window selection logic
Register le logic
Data bypass logic
There are other important pieces of logic that are not considered in this thesis, even
though their delay is a function of issue width. These are:
Caches.
Instruction and data caches provide low latency access to instructions and memory oper-
ands, respectively. In order to provide the necessary load/store bandwidth [SF91] in a
17
superscalar processor, the cache has to be banked or duplicated. The access time of a
cache is a function of the size of the cache and the associativity of the cache. Wada et al.
[WRP92] and Wilton and Jouppi [WJ94] have developed detailed models that estimate the
access time of a cache given its size and associativity.
Instruction fetch logic
Besides the instruction cache, there are other important parts of fetch logic whose com-
plexity varies with dispatch width. First of all, as instruction issue widths grow beyond the
size of a single basic block, it will become necessary to predict multiple branches every
cycle. Then, non-contiguous blocks of instructions will have to be fetched from the
instruction cache and compacted into a contiguous block prior to renaming. Rotenberg et
al. [RBS96] describe the logic required for these operations. However, delay models
remain to be developed. And, although they are important, they are not considered here.
Finally, it must be pointed out once again that in real designs there may be structures not
listed above that inuence the overall delay of the critical path. However, our realistic aim
is not to study all of them but to analyze in detail some important ones that have been
reported in the literature. We believe that our basic technique can be applied to others,
however.
2.1.2 Current Implementations
The structures identied above were presented in the context of the baseline superscalar
model shown in Figure 2-1. The MIPS R10000 [Yea96], and the DEC 21264 [Gwe96a]
are two implementations of this model. Hence, the structures identied above apply to
these two processors.
On the other hand, the Intel Pentium Pro [Gwe95b], the PowerPC 604 [SDC95], and the
HAL SPARC64 [Gwe95a] are based on the reservation model shown in Figure 2-2. There
are two main differences between the two models. First, in the baseline model all the reg-
ister values, both speculative and non-speculative, reside in the physical register le. In the
reservation station model, the reorder buffer holds speculative values and the register le
18
holds only committed, non-speculative data. Second, operand values are not broadcast to
the window entries in the baseline model - only their tags are broadcast; data values go to
the physical register le. In the reservation station model, completing instructions broad-
cast result values to the reservation stations. Issuing instructions read their operand values
from the reservation station.
The point to be noted is that the basic structures identied earlier are also present in the
reservation station model and are as critical as in the baseline model. The only notable dif-
ference is that the reservation station model has a smaller physical register le (equal to
the number of architected registers) and might not demand as much bandwidth (as many
ports) as the register le in the baseline model, because in this case some of the operands
come from the reorder buffer and the reservation stations.
While the discussion of potential sources of complexity is in the context of a baseline
superscalar model that is out-of-order, it must be pointed out that some of the critical
structures identied apply to in-order processors too. For example, the register le logic,
and the data bypass logic are also present in in-order superscalar processors.
Figure 2-2. Reservation stations-based superscalar model.
WAKEUP
SELECT
FETCH RENAME
REG READ EXECUTE
BYPASS
DCACHE
ACCESS
COMMIT
F
E
T
C
H
R
E
N
A
M
E
R
E
O
R
D
E
R
B
Y
P
A
S
S
D
A
T
A
C
A
C
H
E
FUNC. UNITS
+
+
R
E
G
F
I
L
E
B
U
F
F
E
R
WAKEUP+SELECT
ROB READ
WINDOW
19
2.2 Methodology
Each structure was studied in two phases. In the rst phase, a representative CMOS cir-
cuit was selected for the structure. This was done by studying designs published in the lit-
erature
1
and by collaborating with engineers at Digital Equipment Corporation. In cases
where there was more than one possible design, we performed a preliminary study of the
designs to select one that was most promising. In one case, register renaming, we had to
study (simulate) two different schemes.
In the second phase, the circuit was implemented and optimized for speed. Circuits were
designed mostly using static logic. We believe that power and robustness considerations
will make static logic more attractive than dynamic logic in future. However, in situations
where dynamic logic helped boost the performance signicantly, dynamic logic was used.
For example, in the window wakeup logic, a dynamic 7-input NOR gate was used for
comparisons instead of a static gate. A number of optimizations were applied to improve
the speed of the circuits. First, all the transistors in the circuit were manually sized so that
overall delay improved. Second, logic optimizations like two-level decomposition were
applied to reduce fan-in requirements. Static gates with a fan-in greater than four were
avoided. Third, in some cases transistor reordering was used to shorten the critical path.
Some of the optimization sites will be pointed out when the individual circuits are
described.
We used the HSPICE circuit simulator [Met87] from MetaSoftware to simulate the cir-
cuits. In order to simulate the effect of wire parasitics, parasitics were added at appropriate
nodes in the Hspice model of the circuit. These parasitics were computed by calculating
the length of the wires based on the layout of the circuit and using the values of R
metal
and
C
metal
the resistance and parasitic capacitance of metal wires per unit length.
1. Mainly proceedings of the ISSCC International Solid-State and Circuits Conference.
20
To study the effect of reducing the feature size on the delays of the structures, we simu-
lated the circuits for three different feature sizes: 0.8m, 0.35m, and 0.18m respec-
tively. The process parameters for the 0.8m CMOS process were taken from Johnson and
Jouppis synthetic model [JJ90]. These parameters were used by Wilton and Jouppi
[WJ94] to study the access time of caches. Because process parameters are proprietary
information, we had to use extrapolation to come up with process parameters for the
0.35m and 0.18m technologies. We used the 0.8m process parameters from Johnson
and Jouppis synthetic model [JJ90], 0.5m process parameters from MOSIS, and process
parameters used in the literature as inputs. The process parameters assumed for the three
technologies are listed in Appendix A. Layouts for the 0.35m and 0.18m technologies
were obtained by appropriately shrinking the layout for the 0.8m technology.
Finally, basic RC circuit analysis was used to develop simple analytical models that cap-
tured the dependence of the delays on microarchitectural parameters like issue width and
window size. The relationships predicted by the Hspice simulations were compared
against those predicted by our model. In most of the cases, our models were accurate in
identifying the relationships.
2.2.1 Caveats
The above methodology does not address the issue of how well the assumed circuits
reect real circuits for the structures. However, by basing our circuits on designs published
by microprocessor vendors, we believe that the assumed circuits are close to real circuits.
In practice, many circuit tricks can be employed to optimize critical paths for speed. How-
ever, we believe that the relative delay times between different congurations should be
more accurate than the absolute delay times. Because we are mainly interested in nding
trends in the manner in which delays of the structures vary with microarchitectural param-
eters like window size and issue width, and how the delays scale as the feature size is
reduced, we believe that our results are valid.
21
It must also be pointed out that while the absolute delay times presented in this thesis
track the resulting clock speed, they cannot be directly converted into clock speeds. There
are two reasons for this. First, we do not include the delay of inter-stage latches and the
delay resulting from clock skew in our measurements. These two components can be
responsible for a non-trivial fraction of the total delay [NH97], especially for high fre-
quency designs. Second, the delay of a design can show considerable variance with pro-
cess parameters and temperature of operation. Commercial designs are required to operate
over a range of process parameters and physical temperatures. Our designs were simulated
for a single set of process parameters and a single temperature point (25 C).
2.2.2 Terminology
Table 2.1 denes some of the common terms used in the rest of this chapter. The remain-
ing terms will be dened when they are introduced.
2.3 Technology Trends
Feature sizes of MOS devices have been steadily decreasing. This trend [Ass97] towards
smaller devices is likely to continue at least for the next decade. In this section, we briey
Symbol Represents
IW Issue width
WINSIZE Window size
NVREG Number of logical registers
NPREG Number of physical registers
NVREG
width
Width of logical register tags
NPREG
width
Width of physical register tags
DATA
width
Width of datapath
R
metal
Resistance of metal wire per unit length
C
metal
Capacitance of metal wire per unit length
Table 2.1: Terminology.
22
discuss the effect of shrinking feature sizes on circuit delays. The effect of scaling feature
sizes on circuit performance is an active area of research [D
+
74, MF95]. We are only inter-
ested in illustrating the trends in this section.
Circuit delays consist of logic delays and wire delays. Logic delays result from gates
that drive other gates. Wire delays are the delays resulting from driving values on wires.
2.3.1 Logic Delays
The delay of a logic gate can be written as
where C
L
is the load capacitance at the output of the gate, V is the supply voltage, and I is
the average charging/discharging current. I is a function of I
dsat
the saturation drain
current of the devices forming the gate. As the feature size is reduced, the supply voltage
has to be scaled down to keep the power consumption at manageable levels. Because volt-
ages cannot be scaled arbitrarily they follow a different scaling curve from feature sizes.
For submicron devices [Rab96], if S is the scaling factor for feature sizes, and U is the
scaling factor for supply voltages, then C
L
, V, and I scale by factors of , , and
respectively. Hence, the overall gate delay scales by a factor of . Therefore, gate
delays decrease uniformly as the feature size is reduced.
2.3.2 Wire Delays
If L is the length of a wire, then the intrinsic RC delay of the wire is given by
where R
metal
, C
metal
are the resistance and parasitic capacitance of metal wires per unit
length respectively and L is the length of the wire. The factor 0.5 is introduced because we
use the rst order approximation that the delay at the end of a distributed RC line is
Delay
gate
C
L
V ( ) I =
1 S 1 U
1 U 1 S
Delay
wire
0.5 R
metal
C
metal
L
2
=
23
(we assume the resistance and capacitance are distributed uniformly over the
length of the wire).
In order to study the impact of shrinking feature sizes on wire delays we rst have to
analyze how the resistance, R
metal
, and the parasitic capacitance, C
metal
, of metal wires
vary with feature sizes. We use the simple model presented by Bohr [Boh95] to estimate
how R
metal
and C
metal
scale with feature size. Note that both these quantities are per unit
length measures. Using Bohrs model [Boh95],
where width is the width of the wire, thickness is the thickness of the wire, is the resistiv-
ity of metal, and and
0
are permittivity constants.
The average metal thickness has remained relatively constant for the last few genera-
tions while the width has been decreasing in proportion to the feature size. Hence, if S is
the scaling factor for feature sizes, the scaling factor for R
metal
is S. The metal capacitance
has two components: fringe capacitance and parallel-plate capacitance. Fringe capacitance
is the result of capacitance between the side-walls of adjacent wires and capacitance
between the side-walls of the wires and the substrate. Parallel-plate capacitance is the
result of capacitance between the bottom-wall of the wires and the substrate. Assuming
that the thickness remains constant, it can be seen from the equation for C
metal
that the
fringe capacitance becomes dominant as we move towards smaller feature sizes. Rahmat
et al. [RNOM95] show that as feature sizes are reduced, the fringe capacitance will be
responsible for an increasingly larger fraction of the total capacitance. For example, they
show that for feature sizes less than 0.1m, the fringe capacitance contributes 90% of the
total capacitance. In order to accentuate the effect of wire delays and to be able to identify
=
=
=
RC ( ) 2
R
metal
width thickness ( )
C
metal
C
fringe
C
parallelplate
+
2
0
thickness ( ) width ( ) 2
0
width ( ) thickness ( ) +
24
their effects, we assume that the metal capacitance is largely determined by the fringe
capacitance and therefore the scaling factor for C
metal
is also S.
Using the above scaling factors in the equation for the wire delay, we can compute the
scaling factor for wire delays as,
Note that the length scales as for local interconnects. In this study we are only
interested in local interconnects. This might not be true for global interconnects like the
clock because their length also depends on the die size.
Hence, as feature sizes are reduced, wire delays remain constant. This, coupled with the
fact that logic delays decrease uniformly with feature size, implies that wire delays will
dominate total delays in future. In reality, the situation is further aggravated for two rea-
sons. First, not all wires reduce in length perfectly (by a factor of S). Second, some of the
global wires, like the clock, actually increase in length due to bigger dice that are made
possible with each generation.
McFarland and Flynn [MF95] studied various scaling schemes for local interconnect
and conclude that a quasi-ideal scaling scheme closely tracks future deep submicron tech-
nologies. Quasi-ideal scaling performs ideal scaling of the horizontal dimensions but
scales the thickness more slowly. The scaling factor for RC delay per unit length for their
scaling model is . In comparison, for our scaling model, the scal-
ing factor for RC delay per unit length is a more conservative, and simpler, .
2.4 Complexity Analysis
In this section we discuss the critical structures in detail. The presentation of each struc-
ture is organized as follows. First, we describe the logical function implemented by the
=
=
Scaling Factor
S S 1 S ( )
2
1
1 S
0.9 S
1.5
0.1 S
2.5
+ ( )
S
2
25
structure. Then, we present possible schemes for implementing the structure and describe
one of the schemes in detail. Next, we analyze the overall delay of the structure in terms of
microarchitectural parameters like issue width and window size using simple delay mod-
els. Finally, we present Spice simulation results, identify trends in the results and discuss
how the results conform to the delay analysis performed earlier.
2.4.1 Register Rename Logic
The register rename logic is used to translate logical register designators into physical
register designators. Logically, this is accomplished by accessing a map table with the log-
ical register designator as the index. Because multiple instructions, each with multiple reg-
ister operands, need to be renamed every cycle, the map table has to be multi-ported. For
example, a 4-wide issue machine with two read operands and one write operand per
instruction requires 8 read ports and 4 write ports into the mapping table. The high level
block diagram of the rename logic is shown in Figure 2-3. The map table holds the current
logical to physical mappings. In addition to the map table, dependence check logic is
required to detect cases where the logical register being renamed is written by an earlier
instruction in the current group of instructions being renamed. The dependence check
logic detects such dependences and sets up the output MUXes so that the appropriate
physical register designators are generated. The shadow table is used to checkpoint old
mappings so that the processor can quickly recover to a precise state from branch mispre-
dictions. At the end of every rename operation, the map table is updated to reect the new
logical to physical mappings created for the result registers of the current rename group.
26
2.4.1.1 Structure
The mapping and checkpointing functions of the rename logic can be implemented in at
least two ways. These two schemes, called the RAM scheme and the CAM scheme, are
described next.
RAM scheme
In the RAM scheme, as implemented in the MIPS R10000 [Yea96], the map table is a
RAM where each entry contains the physical register that is mapped to the logical register
whose designator is used to index the table. The number of entries in the map table is
equal to the number of logical registers. A single cell of the table is shown in Figure 2-5. A
shift register, present in every cell, is used for checkpointing old mappings.
The map table works like a register le. The bits of the physical register designators are
stored in the cross-coupled inverters in each cell. A read operation starts with the logical
register designator being applied to the decoder. The decoder decodes the logical register
designator and raises one of the word lines. This triggers bit line changes which are sensed
by a sense amplier and the appropriate output is generated. Precharged, double-ended bit
lines are used to improve the speed of read operations. Mappings are checkpointed by
DEPENDENCE
CHECK
LOGIC (SLICE)
MAP
TABLE
MUX
LOGICAL SOURCE
REGS
LOGICAL DEST.
REGS
LOGICAL SOURCE
REG R
PHYSICAL SOURCE
REGS
PHYSICAL
DEST.
REGS
PHYSICAL
REG FOR
REG R
Figure 2-3. Register rename logic.
27
copying the current contents of each cell into the shift register. Recovery is performed by
writing the bit in the appropriate shift register cell back into the main cell.
CAM scheme
An alternative scheme for register renaming uses a CAM (content-addressable memory)
to store the current mappings. Such a scheme is implemented in the HAL SPARC
[AMG
+
95] and the DEC 21264 [Kel96]. The number of entries in the CAM is equal to the
number of physical registers. Each entry contains two elds. The rst eld stores the logi-
cal register designator that is mapped to the physical register represented by the entry. The
second eld contains a valid bit that is set if the current mapping is valid. The valid bit is
required because a single logical register designator might map to more than one physical
register. When a mapping is changed, the logical register designator is written into the
entry corresponding to a free physical register and the valid bit of the entry is set. At the
same time, the valid bit used for the previous mapping is located through an associative
search and cleared.
The rename operation in this scheme proceeds as follows. The CAM is associatively
searched with the logical register designator. If there is a match and the valid bit is set, a
read enable wordline corresponding to the CAM entry is activated. An encoder (ROM) is
used to encode the read enable word lines (one per physical register) into a physical regis-
ter designator. Old mappings are checkpointed by storing the valid bits from the CAM into
a checkpoint RAM. To recover from an exception, the valid bits corresponding to the old
mapping are loaded into the CAM from the checkpoint RAM. In the HAL design, up to 16
old mappings can be saved.
The CAM scheme is less scalable than the RAM scheme because the number of CAM
entries, which is equal to the number of physical registers, increases with issue width. In
order to support such a large number of physical registers, the CAM will have to be appro-
priately banked. On the other hand, in the RAM scheme, the number of entries in the map
28
table is independent of the number of physical registers. However, the CAM scheme has
an advantage with respect to checkpointing. In order to checkpoint in the CAM scheme,
only the valid bits have to be saved. This is easily implemented by having a RAM adjacent
to the column of valid bits in the CAM. In other words, the dimensions of the individual
CAM cells is independent of the number of checkpoints. On the other hand, in the RAM
scheme, the width of individual cells is a function of the number of checkpoints because
this number determines the length of the shift register in each cell.
The dependence check logic proceeds in parallel with the map table access. Every logi-
cal register designator being renamed is compared against the destination register designa-
tors (logical) of earlier instructions in the current rename group. If there is a match, then
the tag corresponding to the physical register assigned to the earlier instruction is used
instead of the tag read from the map table. For example, in the case shown in Figure 2-4,
the last instructions operand register r4 is mapped to p7 and not p2. In the case of more
than one match, the tag corresponding to the latest (in dynamic order) match is used. We
implemented the dependence check logic for issue widths of 2, 4, and 8. We found that for
these issue widths, the delay of the dependence check logic is less than the delay of the
map table, and hence the check can be hidden behind the map table access.
MAPTABLE
FREE REGS
add r1,r2,r3
add r4,r2,r5
add r2,r3,r4
8
3
9
2
6
MAPTABLE
FREE REGS
add p1,p3,p9
add p7,p3,p6
add p4,p9,p7
1
4
9
7
6
RENAMING
1 7 4 11 11
Figure 2-4. Renaming example showing dependency checking. The rst entry of the map
table corresponds to logical register r1.
29
2.4.1.2 Delay Analysis
We implemented both the RAM scheme and the CAM scheme. We found the perfor-
mance of the two schemes to be comparable for the design space we explored. To keep the
analysis short and since the RAM scheme is more scalable, we will only discuss the RAM
scheme here.
A single cell of the map table is shown in Figure 2-5. The critical path for the rename
logic is the time it takes for the bits of the physical register designator to be output after
the logical register designator is applied to the address decoder. The delay of the critical
path consists of three components: the time taken to decode the logical register designator,
the time taken to drive the wordline, the time taken by an access stack to pull the bitline
low plus the time taken by the sense amplier to detect this bitline change and produce the
corresponding output. The time taken for the output of the map table to pass through the
MUX in Figure 2-3 is ignored because this is very small compared to the rest of the
rename logic and, more importantly, the control input of the MUX is available in advance
L
O
G
I
C
A
L
R
E
G
I
S
T
E
R
I
D
WORDLINE
B
I
T
L
I
N
E
SENSE
AMPLIFIER
D
E
C
O
D
E
R
PHYSICAL TAG BIT
ROW0
ROW1
ROW NVREG - 1
SHIFT
REG
CELL
READ PORT
WRITE PORT
WORDLINES
B
I
T
L
I
N
E
S
ACCESS STACK
Figure 2-5. Rename map table. This gure shows the map table of the rename logic on the
left and a single cell of the map table on the right.
30
because the dependence check logic is faster than the map table. Hence, the overall delay
is given by,
Each of the components is analyzed next.
Decoder delay
The structure of the decoder is shown in Figure 2-6. We use predecoding to improve the
speed of decoding. The predecode gates are 3-input NAND gates and the row decode gates
are 3-input NOR gates. The output of the NAND gates is connected to the input of the
NOR gates by the predecode lines. The length of these lines is given by,
where cellheight is the height of the a single cell excluding the wordlines, IW is the issue
width, wordline
spacing
is the spacing between wordlines, and NVREG is the number of log-
ical registers. The factor 3 in the equation results from the assumption of 3-operand
instructions (2 read operand and 1 write operand). With these assumptions, 3 ports (2 read
ports and 1 write port) are required per cell for each instruction being renamed. Hence, for
a IW-wide issue machine, a total of 3 IW wordlines are required for each cell
The decoder delay is the time it takes to decode the logical register designator i.e. the
time it takes for the output of the NOR gate to rise after the input to the NAND gate has
been applied. Hence, the decoder delay can be written as
Delay T
decode
T
wordline
T
bitline
+ + =
PredeclineLength cellheight 3 IW wordline
spacing
+ ( ) NVREG =
T
decode
T
nand
T
nor
+ =
31
where T
nand
is the delay of the NAND gate and T
nor
is the delay of the NOR gate. From
the equivalent circuit of the NAND gate shown in Figure 2-6.
R
eq
consists of two components: the resistance of the NAND pull-down and the metal
resistance of the predecode line connecting the NAND gate to the NOR gate. Hence,
Note that we have divided the resistance of the predecode line by two; the rst order
approximation for the delay at the end of a distributed RC line is RC/2 (we assume that the
resistance and capacitance are distributed evenly over the length of the wire).
L
O
G
I
C
A
L
R
E
G
I
S
T
E
R
B
I
T
S
WORDLINE
WORDLINE DRIVER
WLINV
PREDECODE
NAND GATES
DIRECT DECODE
NOR GATES
ROW 0
ROW NVREG-1
C
eq
R
eq
INPUT OF NOR GATE
Figure 2-6. Decoder structure and equivalent circuit.
P
R
E
D
E
C
O
D
E
L
I
N
E
S
T
nand
c
0
R
eq
C
eq
=
R
eq
R
nandpd
0.5 PredeclineLength R
metal
+ =
32
C
eq
consists of three components: the diffusion capacitance of the NAND gate, the gate
capacitance of the NOR gate, and the metal capacitance of the predecode wire. Hence,
Substituting the above equations into the overall decoder delay and simplifying, we get
where c
0
, c
1
, and c
2
are constants. The quadratic component results from the intrinsic RC
delay of the predecode lines connecting the NAND gates to the NOR gates. We found that,
at least for the design space and technologies we explored, the quadratic component is
very small relative to the other components. Hence, the delay of the decoder is linearly
dependent on the issue width. Typical values for the constants are listed in Table B.1 in
Appendix B.
Wordline delay
The wordline delay is dened as the time taken to turn on all the access transistors
(denoted by N1 in Figure 2.7) connected to the wordline after the logical register designa-
C
eq
C
diffcapnand
C
gatecapnor
PredeclineLength C
metal
+ + =
T
decode
c
0
c
1
IW c
2
IW
2
+ + =
. . .
C
wlcap
R
wlres
R
wldriver
PREG
width
CELLS
F
R
O
M
D
E
C
O
D
E
R
WORDLINE
DRIVER
WLINV
CELL
word
word
Figure 2-7. Wordline structure and equivalent circuit.
N1
33
tor has been decoded. The wordline delay is the sum of the delay of the inverter WLINV
and the delay of the wordline driver. Hence,
From the equivalent circuit of the wordline driver shown in the gure, the wordline
driver can be written as
where R
wldriver
is the effective resistance of the pull-up (p-transistor) of the driver, R
wlres
is
the resistance of the wordline, and C
wlcap
is the amount of capacitance on the wordline.
The total capacitance on the wordline consists of two components: the gate capacitance of
the access transistors and the metal capacitance of the wordline wire. The resistance of the
wordline is determined by the length of the wordline. Symbolically,
where PREG
width
is the number of bits in the physical register designator, C
gatecapN1
is the
gate capacitance of the access transistor N1 in each cell, cellwidth is the width of a single
RAM cell excluding the bitlines, bitline
spacing
is the spacing between bitlines, and sreg-
width
is the width of a single bit of the shift register in each cell.
Factoring the above equations into the wordline delay equation and simplifying we get
T
wordline
T
wlinv
T
wldriver
+ =
T
wldriver
c
0
R
wldriver
R
wlres
+ ( ) C
wlcap
=
WordlineLength cellwidth 6 IW bitline
spacing
sreg
width
+ + ( ) PREG
width
=
C
wlcap
PREG
width
C
gatecapN1
WordlineLength C
metal
+ =
R
wlres
0.5 WordlineLength R
metal
=
T
wordline
c
0
c
1
IW c
2
IW
2
+ + =
34
where c
0
, c
1
, and c
2
are constants. Again, the quadratic component results from the intrin-
sic RC delay of the wordline wire and we found that the quadratic component is very
small relative to the other components. Hence, the overall wordline delay is linearly
dependent on the issue width. Typical values for the constants are listed in Table B.2 in
Appendix B.
Bitline delay
The bitline delay is dened as the time between the wordline going high (turning on the
access transistor N1 shown in Figure 2-8) and the output of the sense amplier going high/
low. From the gure this is the sum of the time it takes for one access stack to discharge
the bitline and the time it takes for a sense amplier to detect the discharge. Hence,
SENSE
AMPLIFIER
B
I
T
L
I
N
E
WORDLINE
PRECHARGE
N1
N
V
R
E
G
R
O
W
S
R
astack
R
bitline
C
bitline
sense amplier
input
Figure 2-8. Bitline structure and equivalent circuit. We used Wadas sense amplier
[WRP92].
T
bitline
T
bitdisch e arg
T
senseamp
+ =
35
From the equivalent circuit shown in the gure, the time taken to discharge the bitlines is
determined by the following equations.
where R
astack
is the effective resistance of the access stack (two pass transistors in series),
R
bl
is the resistance of the bitline, C
bl
is the capacitance on the bitline, NPREG is the num-
ber of physical registers, C
diffcap
is the diffusion capacitance of the access stack that con-
nects to the bitline, cellheight is the height of a single RAM cell excluding the wordlines,
and wordline
spacing
is the spacing between wordlines.
Factoring the above equations into the overall delay equation and simplifying we get
where c
0
, c
1
, and c
2
are constants. Again, we found that the quadratic component is very
small relative to the other components. Hence, the overall bitline delay is linearly depen-
dent on the issue width.
BitlineLength cellheight 3 IW wordline
spacing
+ ( ) NVREG =
R
bl
0.5 BitlineLength R
metal
=
C
bl
NVREG C
diffcap
BitlineLength C
metal
+ =
T
bitdisch e arg
c
0
R
astack
R
bl
+ ( ) C
bl
=
T
bitline
c
0
c
1
IW c
2
IW
2
+ + =
36
Overall delay
From the above analysis, the overall delay of the register rename logic can be summa-
rized by the following equation
where c
0
, c
1
, and c
2
are constants. However, the quadratic component is relatively small
and hence, the rename delay is a linear function of the issue width for the design space we
explored. Typical values for the constants are listed in Table B.3 in Appendix B.
2.4.1.3 Spice Results
Figure 2-9 shows how the delay of the rename logic varies with the issue width i.e. the
number of instructions being renamed every cycle for the three technologies. The graph
also shows the breakdown of the delay into the components discussed in the previous sec-
tion. Detailed results for various congurations and technologies are shown in tabular
form in Appendix A.
A number of observations can be made from the graph. The total delay increases linearly
with issue width for the technologies. This is in conformance with the analysis in the pre-
vious section. All the components show a linear increase with issue width. The increase in
the bitline delay is larger than the increase in the wordline delay because the bitlines are
longer than the wordlines in our design. The bitline length is proportional to the number of
logical registers (32 in most cases) whereas the wordline length is proportional to the
width of the physical register designator (less than 8 for the design space we explored)
Delay c
0
c
1
IW c
2
IW
2
+ + =
37
Another important observation that can be made from the graph is that the relative
increase in wordline delay, bitline delay, and hence, total delay with issue width only
worsens as the feature size is reduced. For example, as the issue width is increased from 2
to 8, the percentage increase in bitline delay shoots up from 37% to 53% as the feature
size is reduced from 0.8m to 0.18m. This occurs because logic delays in the various
components are reduced in proportion to the feature size while the presence of wire delays
in the wordline and bitline components cause the wordline and bitline components to fall
at a slower rate. In other words, wire delays in the wordline and bitline structures will
become increasingly important as feature sizes are reduced.
Figure 2-9. Rename delay versus issue width. This graph shows the breakup of rename delay
for issue widths of 2, 4, and 8 for the three technologies.
0
200
400
600
800
1000
1200
1400
1600
Decoder delay
Wordline delay
Bitline delay
0.8m 0.35m 0.18m
2 4 8 2 4 8 2 4 8
R
e
n
a
m
e
d
e
l
a
y
(
p
s
)
38
2.4.1.4 Model Results
Figure 2-10 shows how the delays computed by the model using the constants listed in
Appendix B compare to the Spice results presented earlier. The delays computed by the
analytical models, both for rename logic and for other structures to be presented later, are
not always close to the Spice delays. The differences arise due to a number of reasons.
First, the simple RC analysis makes a number of approximations and simplications that
cause deviation from the Spice result. Second, the simple delay equations used here do not
take into account the slopes of input signals. Third, we could not nd reliable delay mod-
els for quantifying the delay of dynamic gates. Since it is beyond the scope of the thesis,
no attempt was made to develop advanced delay models tailored for this study. However,
the analytical models for the different structures help establish dependence relationships
and identify components that will become increasingly important in future.
2.4.2 Window Wakeup Logic
The wakeup logic is responsible for updating source dependences of instructions in the
issue window waiting for their source operands to become available. Figure 2-11 illus-
trates the wakeup logic. Every time a result is produced, the tag associated with the result
Figure 2-10. Model delay results for rename logic. This graph shows how the model delay
results compare to the Spice results for register rename logic.
0
200
400
600
800
1000
1200
1400
1600
1800
R
e
n
a
m
e
d
e
l
a
y
(
p
s
)
Model delay
Spice delay
2 4 8 2 4 8 2 4 8
0.8m 0.35m 0.18m
39
is broadcast to all the instructions in the issue window. Each instruction then compares the
tag with the tags of its source operands. If there is a match, the operand is marked avail-
able by setting the rdyL or rdyR ag. Once all the operands of an instruction become avail-
able (both rdyL and rdyR are set), the instruction is ready to execute and the rdy ag is set
to indicate this. The issue window is a CAM (content-addressable memory [WE93]) array
holding one instruction per entry. Buffers, shown at the top of the gure, are used to drive
the result tags tag
1
to tag
IW
where IW is the issue width. Each entry of the CAM has (2
IW) comparators to compare each of the result tags against the two operand tags of the
entry. The OR logic combines the comparator outputs and sets the rdyL/rdyR ags.
2.4.2.1 Structure
Figure 2-12 shows a single cell of the CAM array. The cell shown in detail compares a
single bit of the operand tag with the corresponding bit of the result tag. The operand tag
bit is stored in the RAM cell. The corresponding bit of the result tag is driven on the tag
lines. The match line is precharged high. If there is a mismatch between the operand tag
bit and the result tag bit, the match line is pulled low by one of the pull-down stacks. For
example, if tag = 0, and data = 1, then the pull-down stack on the left is turned on and it
pulls the match line low. The pull-down stacks constitute the comparators shown in
Figure 2-12. The matchline extends across all the bits of the tag i.e. a mismatch in any of
RDYR RDYL OPD TAGR OPD TAGL
= =
= =
RDYR RDYL OPD TAGR OPD TAGL
OR OR
TAG
1
TAG
IW
WINSIZE INSTS
Figure 2-11. Window wakeup logic.
40
the bit positions will pull it low. In other words, the matchline remains high only if the
result tag matches the operand tag. The above operation is repeated for each of the result
tags by having multiple tag and matchlines as shown in the gure. Finally, all the match
signals are ORed to produce the ready signal.
There are two observations that can be drawn from the gure. First, there are as many
matchlines as the issue width. Hence, increasing issue width increases the height of each
CAM row. Second, increasing issue width also increases the number of inputs to the OR
block.
2.4.2.2 Delay Analysis
Because the match lines are precharged high, the default value of the ready signal is
high. Hence, the delay of the critical path is the time it takes for a mismatch in a single bit
position to pull the ready signal low. The delay consists of three components: the time
taken by the buffers to drive the tag bits, the time taken for the pull-down stack corre-
sponding to the bit position with the mismatch to pull the match line low, and the time
taken to OR the individual match signals. Symbolically,
OR
RDY
RAM CELL
D
A
T
A
D
A
T
A
T
A
G
1
T
A
G
1
T
A
G
I
W
T
A
G
I
W
PRECHARGE
MATCH1
MATCHIW
Figure 2-12. CAM cell in wakeup logic.
PULL-DOWN STACK
A PD1 PD2
Delay T
tagdrive
T
tagmatch
T
matchOR
+ + =
41
Each of the components is analyzed next.
Tag Drive Time
The tag drive circuit is shown in Figure 2-13. The time taken to drive the tags depends
on the length of the tag lines. The length of the tag lines is given by
where camheight is the height of a single CAM cell excluding the matchlines, and
matchline
spacing
is the spacing between matchlines
1
.
From the equivalent circuit shown in the gure, the time taken to drive the tags is given
by
where R
tagdriverpup
is the resistance of the pull-up of the tag driver, R
tlres
is the metal resis-
tance of the tag line, and C
tlcap
is the total capacitance on the tag line. R
tlres
is determined
1. To be precise matchline
spacing
is the height of a matchline and the associated pull-down stacks.
TaglineLength camheight IW matchline
spacing
+ ( ) WINSIZE =
RESULT TAG BIT
TAG DRIVER
COMPARATOR
PD2
A
WINSIZE ENTRIES
T
A
G
L
I
N
E
C
tlcap
R
tagdriverpup
R
tlres
A
Figure 2-13. Tag drive structure.
V
DD
T
tagdrive
c
0
R
tagdriverpup
R
tlres
+ ( ) C
tlcap
=
42
by the length of the tag lines. C
tlcap
consists of three components: the metal capacitance
determined by the length of the tag line, the gate capacitances of the comparators, and the
diffusion capacitance of the tag driver. Symbolically,
where C
gatecapcomp
is the gate capacitance of the pass transistor PD2 (shown in
Figure 2-13) in the comparators pull-down stack and C
diffcap
is the diffusion capacitance
of the tag driver.
Substituting the above equations into the overall delay equation and simplifying we get
The above equation shows that the tag drive time increases with window size and issue
width. For a given issue width, the total delay is a quadratic function of the window size.
The weighting factor for the quadratic term is a function of the issue width. We found that
the weighting factor becomes signicant for issue widths beyond 2. For a given window
size, the tag drive time is also a quadratic function of issue width. We found that for cur-
rent technologies (0.35m and longer) the quadratic component is relatively small and the
tag drive time is largely a linear function of issue width. However, as the feature size is
reduced to 0.18m the quadratic component also increases in signicance. The quadratic
component results from the intrinsic RC delay of the tag lines. The constants in the equa-
tion are listed in Table B.5 in Appendix B.
In reality, both issue width and window size will be simultaneously increased because a
larger window is required for nding more independent instructions. Hence, we believe
that the tag drive time can become signicant in future designs with wider issue widths,
bigger windows, and smaller feature sizes.
R
tlres
0.5 TaglineLength R
metal
=
C
tlcap
TaglineLength C
metal
C
gatecapcomp
WINSIZE C
diffcap
+ + =
T
tagdrive
c
0
c
1
c
2
IW + ( ) WINSIZE c
3
c
4
IW c
5
IW
2
+ + ( ) WINSIZE
2
+ + =
43
Tag Match time
The tag match time is the time taken for one of the pull-down stacks to pull the
matchline low. From the equivalent circuit shown in Figure 2-14,
where R
pdstack
is the effective resistance of the pull-down stack, R
mlres
is the metal resis-
tance of the matchline, and C
mlcap
is the total capacitance on the match line. R
mlres
can be
computed using
where MatchlineLength is the length of the matchlines, camwidth is the width of the CAM
cell excluding the tag lines, tagline
spacing
is the spacing between tag lines.
C
mlcap
consists of three components: the diffusion capacitance of all the pull-down
stacks connected to the matchline, the metal capacitance of the matchline, and the gate
capacitance of the inverter at the end of the matchline. Hence,
where PREG
width
is the width of the physical register designators, C
diffcap
is the diffusion
capacitance of the pass transistor (marked as PD1 in Figure 2-14) in the pull-down stacks
T
tagmatch
c
0
R
pdstack
R
mlres
+ ( ) C
mlcap
=
MatchlineLength camwidth IW tagline
spacing
+ ( ) PREG
width
=
R
mlres
0.5 MatchlineLength R
metal
=
C
mlcap
2 PREG
width
C
diffcap
MatchlineLength C
metal
C
gatecap
+ + =
44
that is connected to the matchline, and C
gatecap
is the gate capacitance of the inverter at the
end of the match line.
Substituting the equations for R
mlres
and C
mlcap
into the overall delay equation and sim-
plifying we get
Again, we found that the quadratic component is relatively small and hence, the tag
match time is a linear function of issue width. The constants are listed in Table B.6 in
Appendix B.
A drawback of our model for the tag match time is that it does not model the dependence
of the match time on the slope of the tag line signal i.e. the tag drive delay. Our results,
presented in the next section, show that, as a result of this dependence, the tag match time
is also a function of the window size. In other words, a larger window will result in slower
fanning out of the result tags to the comparators in the window entries, thus increasing the
compare time.
TAG DATA
PRECHARGE
MATCH
TO OTHER MATCHLINES
R
pdstack
R
mlres
C
mlcap
PREG
width
CELLS
PD1
Figure 2-14. Tag match structure.
T
tagmatch
c
0
c
1
IW c
2
IW
2
+ + =
45
Match OR time
This is the time taken to OR the individual matchlines to produce the ready signal.
Because the number of matchlines is the same as the issue width, the magnitude of this
delay term is a direct function of issue width. Figure 2-15 shows the OR logic for result
widths of 2, 4, and 8. For an issue width of 8, we use two 4-input NAND stacks followed
by a NOR gate because this is faster than using an 8-input NAND gate. Because the rise
delay of a gate is a linear function of the of the fan-in [WE93,Rab96] we can write the
delay as
where the constants are as shown in Table B.7 in Appendix B.
T
matchOR
c
0
c
1
IW + =
ISSUE WIDTH = 2 ISSUE WIDTH = 4
ISSUE WIDTH = 8
Figure 2-15. Logic for ORing individual match signals.
46
Overall delay
The overall delay of the wakeup logic can be summarized by the following equation:
where the constants are as tabulated in Table B.8 in Appendix B.
2.4.2.3 Spice Results
The graph in Figure 2-16 shows how the delay of the wakeup logic varies with window
size and issue width for 0.18m technology. As expected, the delay increases as window
size and issue width are increased. The quadratic dependence of the total delay on the win-
dow size results from the quadratic increase in tag drive time as discussed in the previous
section. This effect is clearly visible for issue width of 8 and is less signicant for smaller
issue widths. We found similar curves for 0.8m and 0.35m technologies. The quadratic
dependence of delay on window size was more prominent in the curves for 0.18m tech-
nology than for the other two technologies
=
+
+
Delay c
0
c
1
IW c
2
IW
2
+ + ( )
c
3
c
4
IW + ( ) WINSIZE
c
5
c
6
IW c
7
IW
2
+ + ( ) WINSIZE
2
8 16 24 32 40 48 56 64
Window Size
0
50
100
150
200
250
300
350
400
W
a
k
e
u
p
D
e
l
a
y
(
p
s
)
2-way
4-way
8-way
Figure 2-16. Wakeup logic delay versus window size. This graph shows how the delay of the
window wakeup logic varies with window size and issue width for 0.18m technology.
47
Also, issue width has a greater impact on the delay than window size because increasing
issue width increases all the three components of the delay. On the other hand, increasing
window size only lengthens the tag drive time and to a small extent the tag match time.
Overall, the results show that the delay increases by almost 34% going from 2-way to 4-
way and by 46% going from 4-way to 8-way for a window size of 64 instructions. In real-
ity, the increase in delay is going to be even worse because in order to sustain a wider issue
width, a larger window is required to nd independent instructions. We found similar
curves for 0.8m and 0.35m technologies. Detailed results for various congurations and
technologies are shown in tabular form in Appendix A.
The bar graph on the left in Figure 2-17 shows the detailed breakdown of the total delay
for various window sizes for a 8-way processor in 0.18m technology. The tag drive time
increases rapidly with window size. For example, the tag drive time and the tag match time
increase by factors of 4.78 and 1.33 respectively when the window size is increased from 8
to 64. The increase in tag drive time is higher than that of tag match time because the tag
drive time is a quadratic function of the window size. The increase in tag match time with
the window size is not taken into account by our simple model given above because the
model does not take into consideration the slope of the input signals (determined in this
case by the tag drive delay). Also, as shown by the graph, the time taken to OR the match
8 16 24 32 40 48 56 64
0
100
200
300
400
Tag drive
Tag match
Match OR
Figure 2-17. Wakeup logic delay. The graph on the left shows how wakeup delay varies with
window size for a 8-way machine. The graph on the right shows how wakeup delay varies with
issue width for a 64-entry window. Both graphs are for 0.18m technology.
Window size
W
a
k
e
u
p
d
e
l
a
y
(
p
s
)
2 4 8
0
100
200
300
400
Tag drive
Tag match
Match OR
Issue width
W
a
k
e
u
p
d
e
l
a
y
(
p
s
)
48
signals depends only on the issue width and is independent of the window size The graph
on the right in Figure 2-17 shows how the delay of a 64-entry window in 0.18m technol-
ogy varies with issue width. As shown by the delay analysis, all the components increase
with issue width.
Figure 2-18 shows the effect of reducing feature sizes on the various components of the
wakeup delay for an 8-way, 64-entry window processor. The tag drive and tag match
delays do not scale as well as the match OR delay. This is expected because tag drive and
tag match delays include wire delays whereas the match OR delay only consists of logic
delays. Quantitatively, the fraction of the total delay contributed by tag drive and tag
match delay increases from 52% to 65% as the feature size is reduced from 0.8m to
0.18m. This shows that the performance of the broadcast operation will become more
critical in future technologies.
In the above simulation results the window size was limited to a maximum of 64 instruc-
tions because we found that for larger windows the intrinsic RC delay of the tag lines
increases signicantly. As discussed previously, the intrinsic RC delay is proportional to
the square of the window size. Therefore, for implementing larger windows banking
should be used. Banking helps alleviate the intrinsic RC delay by reducing the length of
the tag lines. For example, two-way banking will improve the intrinsic RC delay by a fac-
0
200
400
600
800
1000
1200
1400
Tag drive delay
Tag match delay
Match OR delay
Figure 2-18. Wakeup delay versus feature size. This graph shows how the wakeup delay for a
8-way machine with a 64-entry window varies with feature size.
0.8m 0.35m 0.18m
W
a
k
e
u
p
d
e
l
a
y
(
p
s
)
49
tor of four. At the same time it must be pointed out that banking will introduce some extra
delay due to extra inverter stages and the parasitics introduced by the extension to the tag
lines.
2.4.2.4 Model Results
Figure 2-19 shows how the model results, computed using the constants in Appendix B,
compare to the Spice results. From the graph we can see that the model is successful in
tracking the dependence on issue width and window size.
2.4.3 Window Selection Logic
Selection logic is responsible for selecting instructions for execution from the pool of
ready instructions in the issue window. Some form of selection logic is required for two
reasons. First, the number of ready instructions in the issue window can be greater than the
number of functional units. For example, for a machine with a 32-entry issue window
there could be as many as 32 ready instructions. Second, some instructions can be exe-
cuted only on a subset of the functional units. For example, if there is only one integer
multiplier, all multiply instructions will have to be steered to that functional unit.
Figure 2-19. Model delay results for wakeup logic. This graph shows how the model delay
results compare to the Spice results for 0.18m technology.
8 16 24 32 40 48 56 64
0
100
200
300
400
S-2way
S-4way
S-8way
M-2way
M-4way
M-8way
Window Size
W
a
k
e
u
p
D
e
l
a
y
(
p
s
)
S - Spice
M - Model
50
The inputs to the selection logic are the request signals, termed REQ, one per instruction
in the issue window. The request signal of an instruction is raised when all the operands of
the instruction become available. As discussed in the previous section, the wakeup logic is
responsible for raising the REQ signals. The outputs of the selection logic are the grant
signals, termed GRANT, one per request signal. On receipt of the GRANT signal, the asso-
ciated instruction is issued to the functional unit and the corresponding window entry is
freed for later use. A selection policy is used to decide which of the requesting instructions
is granted the functional unit. We use a selection policy that is based on the location of the
instruction in the window. The HP PA-8000 [Kum96] uses a similar selection policy. We
chose this policy because it allows a simpler, and hence faster, implementation compared
to other more sophisticated policies like oldest ready rst.
2.4.3.1 Structure
The assumed structure of the selection logic is shown in Figure 2-20. The selection logic
is used to select a single instruction for execution on a functional unit. The modications
to this scheme for handling multiple functional units is discussed later. The selection logic
consists of a tree of arbiters. Each arbiter cell functions as follows. If the enable input is
high, then the grant signal corresponding to the highest priority, active input is raised. For
example, if enable = 1, req0 = 0, req1 = 1, req2 = 0, and req3 = 1, then grant1 will be
raised assuming priority reduces as we go from input req0 to input req3. If the enable
input is low, all the grant signals are set to low. In all cases, at most one of the grant sig-
nals is high. The anyreq output signal is raised if any of the input req signals is high
The overall selection logic works in two phases. In the rst phase, the request signals are
propagated up the tree. Each cell raises the anyreq signal if any of its input request signals
is high. This in turns raises the input request signal of its parent arbiter cell. Hence, at the
root cell one or more of the input request signals will be high if there are one or more
instructions that are ready. The root cell then grants the functional unit to one of its chil-
dren by raising one of its grant outputs. This initiates the second phase. In this phase, the
51
grant signal is propagated down the tree to the instruction that is selected. At each level,
the grant signal is propagated down the subtree that contains the selected instruction. The
enable signal to the root cell is high whenever the functional unit is ready to execute an
instruction. For example, for single-cycle ALUs, the enable signal will be permanently
tied to high.
.The selection policy implemented by our assumed structure is static and is strictly based
on location of the instruction in the window. The leftmost entries in the window have the
highest priority. The oldest ready rst policy can be implemented using our scheme by
compacting the issue window to the left every time instructions are issued and by inserting
new instructions at the right end. This ensures that instructions that occur earlier in pro-
gram order occupy the leftmost entries in the window and hence have higher priority than
later instructions. However, it is possible that the complexity resulting from compaction
R
E
Q
0
G
R
A
N
T
0
R
E
Q
1
G
R
A
N
T
1
R
E
Q
2
G
R
A
N
T
2
R
E
Q
3
G
R
A
N
T
3
ANYREQ ENABLE
R
E
Q
0
G
R
A
N
T
0
R
E
Q
1
G
R
A
N
T
1
R
E
Q
2
G
R
A
N
T
2
R
E
Q
3
G
R
A
N
T
3
ANYREQ ENABLE
R
E
Q
0
G
R
A
N
T
0
R
E
Q
1
G
R
A
N
T
1
R
E
Q
2
G
R
A
N
T
2
R
E
Q
3
G
R
A
N
T
3
ANYREQ ENABLE
R
E
Q
0
G
R
A
N
T
0
R
E
Q
1
G
R
A
N
T
1
R
E
Q
2
G
R
A
N
T
2
R
E
Q
3
G
R
A
N
T
3
ANYREQ ENABLE
R
E
Q
0
G
R
A
N
T
0
R
E
Q
1
G
R
A
N
T
1
R
E
Q
2
G
R
A
N
T
2
R
E
Q
3
G
R
A
N
T
3
ANYREQ ENABLE
R
E
Q
0
G
R
A
N
T
0
R
E
Q
1
G
R
A
N
T
1
R
E
Q
2
G
R
A
N
T
2
R
E
Q
3
G
R
A
N
T
3
ANYREQ ENABLE
ISSUE WINDOW
FROM/TO OTHER SUBTREES
PRIORITY ENCODER OR
R
E
Q
0
R
E
Q
1
R
E
Q
2
R
E
Q
3
G
R
A
N
T
1
G
R
A
N
T
2
G
R
A
N
T
3
ANYREQ ENABLE
G
R
A
N
T
0
Figure 2-20. Selection logic. This gure shows the arbiter tree of the selection logic and a
single arbiter cell in detail.
52
could degrade performance. We did not analyze the complexity of compacting in this
study.
Handling Multiple Functional Units
If there are multiple functional units of the same type, then selection logic (shown in
Figure 2-21) comprises a number of blocks of the type studied in the previous section,
stacked in series. The request signals to each block are derived from the requests to the
previous block by masking the request that was granted the previous resource.
An alternative to the above scheme is to extend the arbiter cells so that the request and
grant signals encode the number of resources being requested and granted respectively.
However, we believe that this could considerably slow down the arbiter cells and hence
could perform worse than the stacked design. The stacked design might not be a feasible
alternative beyond two functional units because the resulting delay can be signicant. An
alternative option is to statically partition the window entries among the functional units.
For example, in the MIPS R10000 [Yea96], the window is partitioned into three sets called
the integer queue, oating-point queue, and the address queue. Only instructions in the
integer queue are monitored for execution on the two integer functional units.
2.4.3.2 Delay Analysis
The delay of the selection logic is the time it takes to generate the grant signal after the
request signal has been raised. This is equal to the sum of two terms: the time taken for the
REQ0
GRANT0
REQ0
GRANT0
FU2 ARBITER FU1 ARBITER
Figure 2-21. Handling multiple functional units.
53
request signal to propagate to the root of the tree and the time taken for the grant signal to
propagate from the root to the selected instruction. Symbolically,
where L = log
4
(WINSIZE) is the height of the selection tree, T
reqpropd
is the time taken for
the request signal to propagate through an arbiter cell, T
root
is the delay of the grant output
at the root cell, and T
grantpropd
is the time taken for the grant signal to propagate through
an arbiter cell. Hence, the overall delay can be written as
where c
0
and c
1
are constants as listed in Table B.9 in Appendix B. The base of the loga-
rithmic term is determined by the number of inputs to the arbiter. We found the optimal
number of arbiter inputs to be four in our case. The associated trade-offs are discussed
later.
From the above equations we can see that the delay of the selection logic is proportional
to the height of the tree and the delay of the arbiter cells. The delay has a logarithmic rela-
tionship with the window size. Increasing issue width can also increase the selection delay
if a stacked scheme is used to handle multiple functional units. For the rest of the discus-
sion, we will assume that a single functional unit is being scheduled and hence no stacking
is used. The delay for a stacked design can be easily computed by multiplying our delay
results by the stacking depth. One way to improve the delay of the selection logic is to
increase the radix of the selection tree. However, as we will see shortly, this increases the
delay of a single arbiter cell and could make the overall delay worse.
Arbiter Logic
The circuit for generating the anyreq signal is shown in Figure 2-22. The anyreq signal
is raised if one or more of the input request signals is active. The circuit, implementing the
OR function, consists of a dynamic NOR gate followed by an inverter. The dynamic gate
Delay L 1 ( ) T
reqpropd
T
root
L 1 ( ) T
grantpropd
+ + =
Delay c
0
c
1
WINSIZE ( )
4
log + =
54
was chosen instead of a static OR gate for speed reasons. The circuit operates as follows.
The anyreq node is precharged high. When one or more of the input request signals go
high, the corresponding pull-downs pull the anyreq node low. The inverter in turn raises
the anyreq signal high. The value of T
reqpropd
in the delay equation is the delay of the OR
circuit.
The priority encoder in the arbiter cell is responsible for generating the grant signals.
The logic equations for the grant signals are:
For example, grant2 is high only if the cell is enabled, the input requests req0 and req1
are low, and req2 is high. Because the request signals at each cell, except for the root, are
available well in advance of the enable signal we use a two-level implementation for eval-
uating the grant signals. As an example, the circuit for evaluating grant1 is shown in
Figure 2-22. The rst stage evaluates the grant1 signal (node grant1p) assuming the
enable signal is high. In the second stage, the grant1p signal is ANDed with the enable to
produce the grant1 signal. This two-level decomposition was chosen because it removes
grant0 req0 enable =
grant1 req0 req1 enable =
grant2 req0 req1 req2 enable =
grant3 req0 req1 req2 req3 enable =
55
the logic for grant1p from the critical path. This optimization does not apply at the root
cell because at the root cell the request signals arrive after the enable signal.
The policy used by the selection logic is embedded in the above equations for the grant
outputs of the arbiter cell. For example, the design presented assumes static priority with
req0 having the highest priority. Implementing an alternative policy would require appro-
priate modications to these equations. Again, the designer has to be careful while select-
ing a policy because a complex policy can increase the delay of the selection logic by
slowing down individual arbiter cells.
Increasing the number of inputs to the arbiter cell slows down both the OR logic and the
priority encoder logic. The OR logic slows down because the load capacitance contributed
by the diffusion capacitance of the pull-downs increases linearly with the number of
inputs. The priority logic slows down because the delay of the logic used to compute pri-
req0 req1 req2 req3
anyreq
anyreq
precharge
precharge
req1
req0
req0
grant1p
enable
enable
grant1
precompute
and with enable priority
Figure 2-22. Arbiter Logic. The block on top shows the logic for the anyreq signal. The bottom
block shows the logic for generating the grant1 signal.
56
ority increases due to higher fan-in. We found the optimal number of inputs to be four in
our case. The selection logic in the MIP R10000, described in [V
+
96], is also based on
four-input arbiter cells.
2.4.3.3 Spice Results
Figure 2-23 shows the delay of the selection logic for various window sizes in the three
technologies assuming a single functional unit is being scheduled. The delay is broken
down into the three components discussed earlier. From the graph we can see that for all
three technologies, the delay increases logarithmically with window size. Also, the
increase in delay is less than 100% when the window size is increased from 16 instruc-
tions to 32 instructions (or from 64 instructions to 128 instructions) because the middle
term in the delay equation, the delay at the root cell, is independent of the window size.
Detailed results are presented in tabular form in Appendix A.
The various components of the total delay scale well as the feature size is reduced. This
is not surprising because all the delays are logic delays. It must be pointed out that the
selection delays presented here are optimistic because we do not consider the wires in the
circuit, especially if it is the case that the request signals originate from the CAM entries
Request propagation delay
Root delay
Grant propagation delay
16 32 64128 16 32 64128 16 32 64128
0
500
1000
1500
2000
2500
3000
Figure 2-23. Selection delay versus window size. This graph shows how the selection delay
varies with window size for the three different feature sizes. The selection policy used is based on
the location of the instruction in the window.
0.8m 0.35m 0.18m
S
e
l
e
c
t
i
o
n
d
e
l
a
y
(
p
s
)
57
in which the instructions reside. On the other hand, it might be possible to minimize the
effect of these wire delays if the ready signals are stored in a smaller, more compact array.
2.4.3.4 Model Results
Figure 2-24 shows how the model delay results, computed using the constants listed in
Appendix B, compare to the Spice results. The signicant difference, especially for 0.8m
technology, results because our delay models are unable to accurately model dynamic
logic.
2.4.4 Register le Logic
The register le provides low latency access to register operands. The access time of the
register le depends on the number of registers in the le and the number of ports into the
le. Assuming two read operands and one write operand per instruction, the number of
read and write ports required for a machine with issue width IW is 2 IW and IW respec-
tively
1
. The number of registers required increases with issue width in order to support a
greater degree of speculative execution. A recent study [FJC96] shows that for signicant
1. In most machine designs additional write ports are implemented for write-back of load data.
Figure 2-24. Model delay results for selection logic. This graph shows how the model delay
results compare to the Spice results for selection logic.
0
500
1000
1500
2000
2500
3000
S
e
l
e
c
t
i
o
n
d
e
l
a
y
(
p
s
)
16 32 64 128 16 32 64 128 16 32 64 128
0.8m 0.35m 0.18m
Spice delay
Model delay
58
performance up to 80 physical registers are required for a 4-wide issue machine and up to
120 physical registers are required for an 8-wide issue machine.
2.4.4.1 Structure
The structure of the register le assumed for this study is similar to that of the map table
shown in Figure 2-5 on page 29. The register le contents are stored in the cross-coupled
inverters in the cells. Each row of cells stores the contents of a single register. Hence, the
number of rows is determined by the number of registers in the le. The number of cells in
each row is determined by the datapath width. We assume a 64-bit datapath for this study.
A read operation starts with the register number (physical) being applied to the decoder.
The decoder decodes the register number and raises one of the wordlines. This triggers bit
line changes which are sensed by a sense amplier and the appropriate output is gener-
ated. We use precharged, double-ended bitlines to improve the speed of read operations.
Read ports are implemented using NAND stacks (two pass gates in series) instead of a sin-
gle pass gate to prevent ipping of cell contents during a read operation, especially for
congurations with a large number of read ports.
There are a few differences between the map table in the register rename logic and the
register le. The shift register component of the map table is not present in the register le.
In the case of the rename logic, the number of rows is determined by the number of logical
registers in the instruction set architecture. The number of rows in the register le is deter-
mined by the number of physical registers. The width of each row in the map table is
determined by the width of the physical register tags. In case of the register le, the width
of each row is determined by the datapath width 64 bits in most current designs.
2.4.4.2 Delay Analysis
The critical path for the register le logic is the time it takes for the contents of the regis-
ter to be output after the register number is applied to the address decoder. The delay of the
critical path consists of three components: the time taken to decode the register number,
the time taken to drive the wordline, and the time taken by an access stack to pull the bit-
59
line low and for the sense amplier to detect the change in the bitline and produce the cor-
responding output. Hence, the overall delay is given by,
Each of the components is analyzed next. The analysis presented here is similar to that
presented for the rename logic. Hence, gures are omitted and the discussion is kept brief.
Decoder delay
We use the same predecoding scheme as used in the map table of the rename logic s
shown in Figure 2-6 on page 31. The fan-in of the NAND and NOR gates is determined by
the number of bits in the register number i.e. the number of physical registers. Table 2.2
shows the fan-in of the decoder gates for the various register le sizes simulated.
The output of the NAND gates is connected to the input of the NOR gates by the prede-
code lines. The length of these lines is given by
where cellheight is the height of a single cell excluding the wordlines, IW is the issue
width, wordline
spacing
is the spacing between the wordlines, and NPREG is the number of
physical registers. The factor 3 in the equation results from the assumption of 3-operand
Number of
physical registers
Fan-in of
predecode gates
Fan-in of direct
decode gates
32 2 3
64 2 3
128 3 3
256 4 2
512 4 3
Table 2.2: Fan-in of decoder gates.
T
delay
T
decode
T
wordline
T
bitline
+ + =
PredeclineLength 0.5 cellheight 3 IW wordline
spacing
+ ( ) NPREG =
60
instructions (2 read operands and 1 write operand). With these assumptions, 3 ports (1
write port and 2 read ports) are required per cell for each instruction being renamed.
Hence, for a IW-wide issue machine, a total of (3 IW) wordlines are required for each
cell. The factor 0.5 results from the assumption that the predecode NAND gates drive the
predecode lines from the centre of the array. This optimization was necessary to minimize
the RC effects of long predecode lines for large, highly ported congurations.
The decoder delay is the time it takes to decode the register number i.e. the time it takes
for the output of the NOR gate to rise after the input to the NAND gate has been applied.
Hence, the decoder delay can be written as
where T
nand
is the delay of the NAND gate and T
nor
is the delay of the NOR gate. T
nand
is
given by the following equations,
where R
nandpd
is the pull-down resistance of the NAND gate, C
diffcapnand
is the diffusion
capacitance at the output of the NAND gates, C
gatecapnor
is the gate capacitance of the
NOR gates.
Substituting the above equations into the overall decoder delay and simplifying, we get
The above equation shows that the decode time increases with the number of physical
registers and the issue width. For a given issue width, the total delay is a quadratic func-
T
decode
T
nand
T
nor
+ =
T
nand
c
0
R
eq
C
eq
=
R
eq
R
nandpd
0.5 PredeclineLength R
metal
+ =
C
eq
C
diffcapnand
C
gatecapnor
PredeclineLength C
metal
+ + =
T
decoder
c
0
c
1
c
2
IW + ( ) NPREG c
3
c
4
IW c
5
IW
2
+ + ( ) NPREG
2
+ + =
61
tion of register le size. The weighting factor for the quadratic term is a function of the
issue width. For a given register le size, the decode time is also a quadratic function of
issue width. The quadratic components in both cases result from the intrinsic RC delay of
the predecode lines and are small relative to the other components. Typical values of the
constants in the equation are listed in Table B.10 in Appendix B.
Wordline Delay
The wordline delay is dened as the time taken to turn on all the access transistors con-
nected to the wordline after the register number has been decoded. The wordline delay is
the sum of the fall delay of the wlinv inverter and the rise delay of the wordline driver. The
delay of the wordline driver is given by the following equations
where R
wl
is the resistance of the wordline wire, C
wl
is the capacitance on the wordline,
R
wldriver
is the pull-up resistance of the wordline driver, and C
gatecap
is the gate capaci-
tance of the access transistor.
Factoring the above equations into the wordline delay equation and simplifying we get
where c
0
, c
1
, and c
2
are constants listed in Table B.11 in Appendix B. Again, the quadratic
component results from the intrinsic RC delay of the wordline wire and we found that this
component is very small relative to other components. Hence, the overall wordline delay is
linearly dependent on the issue width.
WordlineLength cellwidth 6 IW bitline
spacing
+ ( ) DATA
width
=
R
wl
0.5 WordlineLength R
metal
=
C
wl
DATA
width
C
gatecap
WordlineLength C
metal
+ =
T
wldriver
c
0
R
wldriver
R
wl
+ ( ) C
wl
=
T
wordline
c
0
c
1
IW c
2
IW
2
+ + =
62
Bitline delay
The bitline delay is dened as the time between the wordline going high (turning on the
access transistor N1) and the output of the sense amplier going high/low. This is the sum
of the time it takes for one access stack to discharge the bitline and the time it takes for a
sense amplier to detect the discharge. Hence,
The time taken to discharge the bitlines is determined by the following equations.
where R
astack
is the effective resistance of the access stack (two pass transistors in series),
R
bl
is the resistance of the bitline, C
bl
is the capacitance on the bitline, NPREG is the num-
ber of physical registers, C
diffcap
is the diffusion capacitance of the access stack that con-
nects to the bitline, cellheight is the height of a single RAM cell excluding the wordlines,
and wordline
spacing
is the spacing between wordlines.
Factoring the above equations into the overall delay equation and simplifying we get
The bitline delay shows a similar dependence on issue width and register le size as the
decoder delay. The quadratic components result from the intrinsic RC delay of the bitline
wire. Again, we found that the quadratic component is very small relative to the other
components. Typical values for the constants are listed in Table B.12 in Appendix B.
T
bitline
T
bitdisch e arg
T
senseamp
+ =
BitlineLength cellheight 3 IW wordline
spacing
+ ( ) NPREG =
R
bl
0.5 BitlineLength R
metal
=
C
bl
NPREG C
diffcap
BitlineLength C
metal
+ =
T
bitdisch e arg
c
0
R
astack
R
bl
+ ( ) C
bl
=
T
bitline
c
0
c
1
c
2
IW + ( ) NPREG c
3
c
4
IW c
5
IW
2
+ + ( ) NPREG
2
+ + =
63
Overall delay
From the above analysis, the overall delay of the register le can be summarized by the
following equation:
where the constants are as tabulated in Table B.13 in Appendix B.
2.4.4.3 Spice Results
Figure 2-25 shows how the delay of the register le varies with the number of registers
and the issue width for the case of 0.18m technology. A number of observations can be
made from the graph. First, the delay increases as issue width and the number of registers
are increased. The graph also shows that the total delay is a linear function of the number
of registers. The dependence on issue width is also linear except for larger congurations
(512 registers or more) where the quadratic component start to show. These observations
=
+
+
Delay c
0
c
1
IW c
2
IW
2
+ + ( )
c
3
c
4
IW + ( ) NPREG
c
5
c
6
IW c
7
IW
2
+ + ( ) NPREG
2
l
e
+
+
B
y
p
a
s
s
R
e
g
i
s
t
e
r
l
e
+
+
B
y
p
a
s
s
CLUSTER 0
CLUSTER N-1
LOCAL BYPASSES
I
N
T
E
R
-
C
L
U
S
T
E
R
B
Y
P
A
S
S
E
S
WINDOW
WINDOW
Rename Steer
88
The front-end of the dependence-based superscalar microarchitecture is identical to that
of the conventional microarchitecture except for the addition of steering logic. The steer-
ing logic is responsible for steering instructions to individual clusters based on depen-
dences extracted at run-time. The goal of the steering logic is to make use of the full width
of the machine while minimizing the use of slow inter-cluster communication. Even
though the gure shows the steering logic to be in series with the rename logic, simple
versions of the steering logic can be implemented to operate in parallel with the rename
logic, thus eliminating the need for an extra pipestage. Section 3.3.3 discusses the trade-
offs involved in more detail.
Since the proposed microarchitecture uses the same front-end as a conventional microar-
chitecture, it does not reduce the complexity of instruction fetch and renaming. Extra pip-
estages, at the expense of a reduction in IPC as shown in Section 2.5 in Chapter 2, is one
way to reduce the complexity of the front-end.
Performance factors
The overall performance of a dependence-based microarchitecture is highly dependent
on the amount of ILP that can be extracted relative to the conventional microarchitecture.
If the microarchitecture can sustain comparable IPCs, then its clock speed advantage will
result in higher overall performance. The primary factors that determine the IPCs achieved
by the proposed microarchitecture are:
Load balancing. It is important that instructions are spread out to use as many clusters
as the amount of program parallelism allows. Otherwise, the program will not be able
to take advantage of the full-width of the machine. For example, if we have a 8-way
dependence-based superscalar microarchitecture organized as 4 clusters each being 2-
wide, and if all instructions are steered to a single cluster, the machine will be effec-
tively reduced to a 2-wide machine.
Inter-cluster bypass frequency. Since inter-cluster communication is slow, excessively
using the inter-cluster bypass paths can easily stretch the critical path of the program,
89
resulting in poor performance. Hence, it is essential that the steering logic minimize
the frequency of inter-cluster bypasses exercised. It must be pointed out that inter-clus-
ter bypass frequency must be judged along with load balancing. For example, it is pos-
sible to completely eliminate inter-cluster communication by steering all instructions
to a single cluster. However, performance can be signicantly degraded because of the
reduced effective width of the machine. Hence, the challenge is to be able to balance
the load across multiple clusters while minimizing the frequency of inter-cluster
bypasses.
Steering logic complexity. Complex steering logic will require multiple pipestages that
can result in IPC degradation due to increase in penalties associated with branch
mispredicts and instruction-cache misses. This can reduce the benet of achieving
good load balance and minimizing inter-cluster bypass frequency. Hence, the steering
logic must be kept simple.
The results presented in the rest of the chapter will show that it is possible to achieve
good steering with simple steering heuristics.
3.2 Dependence-based Microarchitectures : An Example
This section describes a particular dependence-based microarchitecture called the fo-
based microarchitecture. The idea behind the fo-based microarchitecture is to exploit the
natural dependences among instructions. A key point is that dependent instructions cannot
execute in parallel. In a single-cluster version of the proposed microarchitecture, shown in
Figure 3-2, the issue window is replaced by a small number of fo buffers. The fo buffers
are constrained to issue in-order, and dependent instructions are steered to the same fo.
This ensures that instructions in a particular fo buffer can only execute sequentially.
Hence, unlike the typical issue window where result tags have to be broadcast to all the
entries, the register availability only needs to be fanned out to the heads of the fo buffers.
The instructions at the fo heads monitor reservation bits (one per physical register) to
90
check for operand availability. This is discussed in detail later. Furthermore, the selection
logic only has to monitor instructions at the heads of the fo buffers.
The steering of dependent instructions to the fo buffers is performed at run-time during
the rename stage. Dependence information between instructions is maintained in a table
called the SRC_FIFO table. This table is indexed using logical register designators. For
example, SRC_FIFO[Ra], the entry for logical register Ra, stores the identity of the fo
buffer containing the instruction that will write register Ra. If that instruction has already
completed i.e. register Ra contains its computed value, then SRC_FIFO[Ra] is invalid.
This table can be accessed in parallel with the rename table. In order to steer an instruction
to a particular fo, the SRC_FIFO table is accessed with the register identiers of the
source operands of an instruction. For example, to steer the instruction add r10,r5,1
where r10 is the destination register, the SRC_FIFO table is indexed with 5. The entry is
then used to steer the instruction to the appropriate fo.
A number of heuristics are possible for steering instructions to the fos. A simple heu-
ristic that we found to work well for our benchmark programs is described next. Let I be
the instruction under consideration. Depending upon the availability of Is operands, the
following cases are possible:
1. All operands available. All the operands of I have already been computed and are
residing in the register le. In this case, I is steered to a new (empty) fo acquired from
Figure 3-2. Fifo-based microarchitecture.
WAKEUP
SELECT
FETCH
RENAME
REG READ COMMIT
F
E
T
C
H
R
E
N
A
M
E
W
A
K
E
U
P
S
E
L
E
C
T
R
E
G
F
I
L
E
B
Y
P
A
S
S
D
A
T
A
C
A
C
H
E
ALUS
+
+
EXECUTE
BYPASS
DCACHE
ACCESS STEER
S
T
E
E
R
F
I
F
O
S
91
a pool of free fos.
2. One outstanding operand. I requires a single outstanding operand to be produced by
instruction I
source
residing in fo F
a
. In this case, if there is no instruction behind
I
source
in F
a
, then I is steered to F
a
, else I is steered to a new fo.
3. Two outstanding operands. I requires two outstanding operands to be produced by
instructions I
left
and I
right
residing in fos F
a
and F
b
respectively. In this case, apply the
heuristic in the previous bullet to the left operand. If the resulting fo is not suitable (it
is either full or there is an instruction behind the source instruction), then apply the
same heuristic to the right operand.
If all the fos are full or if no empty fo is available then the steering logic stalls. A fo
is returned to the free pool when the last instruction in the fo is issued. Initially, all the
fos are in the free pool. Figure 3-3 illustrates the heuristic on a code segment from the
Figure 3-3. Instruction steering example.
0
14
7
8
2
9
4
5
1
3
11
10
6
12
13
0: addu $18,$0,$2
1: addiu $2,$0,-1
2: beq $18,$2,L2
3: lw $4,-32768($28)
4: sllv $2,$18,$20
5: xor $16,$2,$19
6: lw $3,-32676($28)
7: sll $2,$16,0x2
8: addu $2,$2,$23
9: lw $2,0($2)
10: sllv $4,$18,$4
11: addu $17,$4,$19
12: addiu $3,$3,1
13: sw $3,-32676($28)
14: beq $2,$17,L3
0 2
1
3
2
7 5 4
6
11 10
9 8 7 5
11
14 9 8 7
13 12
0,1,3 issue
2,4,6 issue
5,10 issue
7,11,12 issue
TIME
92
SPEC benchmark compress for a 4-wide machine. The listing on the left shows the
dynamic stream of instructions. The directed graph in the middle shows the register
dependences between those instructions. On the right side of the gure are the contents of
the fos in each cycle. Instructions can issue only from the heads of the four fos. The
steering logic steers four instructions every cycle and a maximum of four instructions can
issue every cycle. Consider the steering performed in cycle 1. Instructions 4, 5, 6, and 7
are steered to the appropriate fos. Since instructions 4, 5, and 7 form a dependence chain,
they are steered to the same fo. Because instruction 6 is a ready instruction (which hap-
pens to start a dependence chain) it is steered to a new fo. In the next cycle, instructions
8, 9, 10, and 11 are steered. Since instructions 8 and 9 form a chain that depends on
instruction 7, they are steered to the fo containing instruction 7. Similarly, instructions 10
and 11 form a chain and are steered to a new fo.
3.2.1 Performance of the Fifo-based Microarchitecture
Comparison with window-based superscalar
We compare the performance of the fo-based microarchitecture against that of a typical
microarchitecture with a single, large issue window. The proposed microarchitecture has 8
fos, with each fo having 8 entries. The issue window of the conventional processor has
64 entries. Both microarchitectures can decode, rename, and execute a maximum of 8
Figure 3-4. Performance of single-cluster fo-based microarchitecture.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
I
n
s
t
r
u
c
t
i
o
n
s
P
e
r
C
y
c
l
e
compress gcc go ijpeg m88ksim perl vortex li
window-based
fo-based
93
instructions every cycle. The simulation model assumed is detailed in Table 3.2 on
page 105.
The performance results in terms of instructions committed per cycle are shown in
Figure 3-4. The fo-based microarchitecture extracts similar parallelism as the typical
window-based microarchitecture. The cycle count numbers are within 5% for ve of the
seven benchmarks and the maximum performance degradation is 8.7% in the case of perl.
Fifo utilization
The graph on the left in Figure 3-5 shows the time distribution of the number of active
fos during the execution of m88ksim. A fo is active if it contains at least one instruction.
While the graph shows that for a majority of the time all the fos are utilized, there are
periods during which fewer fos are active. This shows that the distribution of parallelism
in the program is uneven there are phases in which the average number of parallel
chains is small. Other benchmarks show similar results.
The graph on the right in Figure 3-5 shows the time distribution of the depth of a partic-
ular fo during the execution of m88ksim. The graph shows that on average the number of
instructions in a fo is small. This is for two reasons. First, the steering heuristic stalls
whenever a suitable fo is not found. We found that placing the stalled instruction in a ran-
dom fo could degrade performance for certain programs. Second, and more importantly,
frequent branch mispredicts cause breaks in the instruction stream presented to the steer-
ing logic, resulting in shallow fos on the average. We found similar distributions for the
other benchmarks.
Effect of increasing number of fos
Increasing the number of fos increased the performance for all the benchmarks. How-
ever, the improvements were in the 2%-3% range for as many as 12 fos. Eight fos are
able to support most of the parallel chains found at any instance during the execution of
the programs.
94
3.2.2 Complexity Analysis of the Fifo-based Microarchitecture
First, consider the delay of the wakeup and selection logic. Wakeup logic is required to
detect cross-fo dependences. For example, if the instruction I
a
at the head of fo F
a
is
dependent on an instruction I
b
waiting in fo F
b
, then I
a
cannot issue until I
b
completes.
However, the wakeup logic does not involve broadcasting result tags to all the waiting
instructions. Instead, only the instructions at the fo heads have to determine when all
their operands are available. This is accomplished by interrogating a table called the reser-
vation table. The reservation table contains a single bit per physical register that indicates
whether the register is waiting for its data. When an instruction is dispatched, the reserva-
tion bit corresponding to the physical register is set. The bit is cleared when the instruction
executes and the result value is produced. An instruction at the fo head waits until the
reservation bits corresponding to its operands are cleared. Hence, the delay of the wakeup
logic is determined by the delay of accessing the reservation table. The reservation table is
relatively small in size compared to the rename table and register le. For example, for a
4-way machine with 80 physical registers, the reservation table can be laid out as a 10-
entry table with each entry storing 8 bits. A column MUX is used to select the appropriate
bit from each entry. Table 3.1 shows the delay of the reservation table for 4-way and 8-
Figure 3-5. Fifo utilization. The graph on the left shows the number of active fos during the
execution of m88ksim. The graph on the right shows the depth of a particular fo during the
execution of the program.
0 1 2 3 4 5 6 7 8
0
10
20
30
40
50
60
%
T
o
t
a
l
t
i
m
e
Number of active fifos
1 2 3 4 5 6 7 8
0
10
20
30
40
50
Fifo depth
%
T
o
t
a
l
t
i
m
e
95
way machines. For both cases, the wakeup delay is much smaller than the wakeup delay
for a 4-way, 32-entry issue window-based microarchitecture. Also, this delay is smaller
than the corresponding register renaming delay. The selection logic in the fos depen-
dence-based microarchitecture is simple because only the instructions at the fo heads
need to be considered for selection.
Instruction steering is done in parallel with register renaming. Because the SRC_FIFO
table is smaller than the rename table we expect the delay of steering to be less than the
rename delay. In case a more complex steering heuristic is used, the extra delay can easily
be moved into the wakeup/select stage or a new pipestage can be introduced at the cost
of an increase in the branch mispredict and instruction-cache miss penalties.
In summary, the complexity analysis presented above shows that by reducing the delay
of the window logic signicantly, it is likely that the fo-based microarchitecture can be
clocked faster than the typical microarchitecture. Combining the potential for a much
faster clock with the results indicate the dependence-based microarchitecture is capable of
superior performance relative to a conventional microarchitecture.
Issue width # physical regs # table entries Bits/entry Delay(ps)
4 80 10 8 192.1
8 128 16 8 251.7
Table 3.1: Delay of reservation table in 0.18m technology.
96
3.2.3 Clustering the Fifo-based Microarchitecture
The real advantage of the fo-based microarchitecture is for building machines with
issue widths greater than four where, as shown in the previous chapter, the delay of both
the large window and the long bypass busses can be signicant and can considerably slow
the clock. Dependence-based microarchitectures based on fos are ideally suited for such
situations because they simplify both the window logic and the bypass logic as well as nat-
urally facilitate efcient steering. Such a microarchitecture for building an 8-way machine
is described next.
Consider the 2X4-way clustered system shown in Figure 3-6. Two clusters are used,
each of which contains four fos, one copy of the register le, and four functional units.
Renamed instructions are steered to a fo in one of the two clusters. Local bypasses
(shown using thick lines) permit same-cycle bypassing inside each cluster. Local bypass-
ing can be accomplished within a cycle. Inter-cluster bypasses, responsible for bypassing
Figure 3-6. Fifo-based microarchitecture with two clusters.
R
e
g
i
s
t
e
r
l
e
+
+
B
y
p
a
s
s
R
e
g
i
s
t
e
r
l
e
+
+
B
y
p
a
s
s
Fetch
Decode
Rename
Steer
CLUSTER 0
CLUSTER 1
LOCAL BYPASSES
I
N
T
E
R
-
C
L
U
S
T
E
R
B
Y
P
A
S
S
E
S
FIFOS
FIFOS
97
values between functional units residing in different clusters, take one or more additional
cycles.
This dependence-based microarchitecture using fos has a number of advantages. First,
wakeup and selection logic are simplied as noted previously. Second, because of the heu-
ristic for assigning dependent instructions to fos, and, indirectly, to clusters, local
bypasses are used much more frequently than inter-cluster bypasses, reducing overall
bypass delays.
3.2.4 Overall Performance of the Clustered Fifo-based Microarchitecture
The graph on the left in Figure 3-7 compares performance, in terms of instructions com-
mitted per cycle (IPC), for the 2X4-way dependence-based microarchitecture against that
of a conventional 8-way microarchitecture with a single 64-entry issue window. For the
dependence-based microarchitecture, instructions are steered using the heuristic described
in Section 3.2. Local bypasses complete within a cycle while inter-cluster bypasses take 2
cycles. Also, in the conventional 8-way system all bypasses are assumed to complete in a
single cycle. From the graph we can see that for most of the benchmarks, the dependence-
based microarchitecture is nearly as effective as the window-based microarchitecture even
though the dependence-based microarchitecture is handicapped by slow inter-cluster
bypasses that take 2 cycles. However, for two of the benchmarks, m88ksim and ijpeg, the
Figure 3-7. Performance of the clustered fo-based microarchitecture.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
I
n
s
t
r
u
c
t
i
o
n
s
P
e
r
C
y
c
l
e
window-based
2-cluster fifo-based
compress gcc go ijpeg li m88ksim perl vortex
98
performance degradation is close to 13%. We found that this degradation is mainly due to
extra latency introduced by the slow inter-cluster bypasses.
Because the dependence-based microarchitecture will facilitate a faster clock, a fair per-
formance comparison must take clock speed into account. The local bypass structure
within a cluster is equivalent to a conventional 4-way superscalar machine, and inter-clus-
ter bypasses are removed from the critical path by taking an extra clock cycle. Conse-
quently, the clock speed of the dependence-based microarchitecture is at least as fast as the
clock speed of a 4-way, 32 entry window-based microarchitecture, and is likely to be sig-
nicantly faster because of the smaller (wakeup + selection) delay compared to a conven-
tional issue window as discussed in Section 3.2.2. Hence, if C
dep
is the clock speed of the
dependence-based microarchitecture and C
win
is the clock speed of the window-based
microarchitecture then from Table A.10 in Appendix A for 0.18m technology,
In other words, the dependence-based microarchitecture is capable of supporting a clock
that is 25% faster than the clock of the window-based microarchitecture. Taking this factor
into account (and ignoring other pipestages that may have to be more deeply pipelined),
we can estimate the potential speedup with a dependence-based microarchitecture. The
speedups for the benchmarks are graphed in Figure 3-8. From the graph we can see that
Figure 3-8. Potential improvements with the fo-based microarchitecture.
0
5
10
15
20
P
e
r
f
o
r
m
a
n
c
e
I
m
p
r
o
v
e
m
e
n
t
(
%
)
compress gcc go ijpeg li m88ksim perl vortex
C
dep
C
win
-----------
delay of 8-way 64-entry window
delay of 4-way 32-entry window
----------------------------------------------------------------------------- 1.25 =
99
the dependence-based microarchitecture is capable of providing superior overall perfor-
mance. The performance improvements vary from 9% to 21% with an average improve-
ment of 14%.
Overall, our results show that the dependence-based microarchitecture using fos is
capable of superior performance due to its ability to support a fast clock while extracting
signicant levels of instruction-level parallelism.
3.2.5 Effect of Scaling Instruction and Data Cache Miss Latency
The clock advantage of the fo-based microarchitecture could potentially increase cache
miss latencies (measured in clock cycles). In order to quantify this effect, we studied the
performance of the fo-based microarchitecture when the cache miss latency is scaled by
the same amount as the clock speed improvement. For example, a cache miss that took 6
cycles to complete would now take 8 cycles (7.5 cycles to be precise) due to the 25%
improvement in clock speed.
Figure 3-9 graphs the results for base cache miss latencies of 6 cycles and 12 cycles.
These latencies translate to 8 and 15 cycles respectively when the 25% clock speed advan-
tage is taken into account. The win.Ncycles bars show the IPC for the window-based
Figure 3-9. Effect of Scaling Instruction and Data Cache Miss Latency.
I
n
s
t
r
u
c
t
i
o
n
s
P
e
r
C
y
c
l
e
win.6cycles fifo.8cycles fifo.6cycles
win.12cycles fifo.15cycles fifo.12cycles
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
compress gcc go ijpeg li perl m88ksim vortex
100
superscalar with same-cycle bypassing between functional units assuming a cache miss
latency of N cycles. The fos.Ncycles bars show the IPC for the 2-cluster fo-based
microarchitecture assuming a cache miss latency of N cycles. From the graph, we can see
that the increase in cache miss latency due to clock speed improvement does not signi-
cantly impact the performance of the fo-based microarchitecture. The highest reduction
in IPC occurs for gcc the performance reduction with respect to window-based super-
scalar went up from 4.5% to 8.0% when the cache miss latency is increased from 6 cycles
to 8 cycles. The performance reductions are slightly higher when the base cache miss
latency is increased to 12 cycles. The primary reason why the IPCs achieved for both the
fo-based microarchitecture and the window-based microarchitecture are not very sensi-
tive to the cache miss latency for most benchmarks is the low cache miss rates of the
benchmarks. The 32KB, 2-way L1 instruction and data caches are able to satisfy most of
the memory accesses.
3.3 Other Dependence-based Microarchitectures
The microarchitecture presented in the previous section is one point in the design space
of dependence-based microarchitectures. The fo-based microarchitecture simplies both
the window logic and naturally reduces the performance degradation due to slow inter-
cluster bypass paths. This section describes some other interesting points in the design
space. In each case there are multiple clusters with inter-cluster bypasses taking multiple
cycles to complete.
3.3.1 Single Window, Multiple Execution Clusters, Execution-driven Steering
In this design, shown in Figure 3-10, instructions reside in a central window while wait-
ing for their operands and functional units to become available. Instructions are assigned
to the clusters at the time they begin execution; this is execution-driven steering. With this
steering, cluster assignment works as follows. The register values in the clusters become
available at slightly different times, that is, the result register value produced by a cluster is
available in that cluster one cycle earlier than in the other cluster. Consequently, an
101
instruction waiting for the value may be enabled for execution a few cycles (equal to the
inter-cluster latency) earlier than in the other clusters. The selection logic monitors the
instructions in the window and attempts to assign them to the cluster which provides their
source values rst (assuming there is a free functional unit in the cluster). Instructions that
have their source operands available in all clusters are considered for assignment in a
round-robin fashion starting with cluster 0. Static instruction order is used to break ties in
this case.
The execution-driven approach uses a greedy policy to minimize the use of slow inter-
cluster bypasses while maintaining a high utilization of the functional units. It does so by
postponing the assignment of ready instructions to clusters until execution time. While
this greedy approach might gain some IPC advantages, this design suffers from the previ-
ously discussed drawbacks of a central window and complex selection logic.
3.3.2 Multiple windows, Dispatch-driven Steering
This design, shown in Figure 3-10, is identical to the fo-based microarchitecture pre-
sented in Section 3.2 except that each cluster has a completely exible window instead of
fos. Instructions are steered to the windows using a heuristic that takes both dependences
between instructions and the relative load of the clusters into account.
Figure 3-10. Other dependence-based microarchitectures.
CLUSTER 0
CLUSTER N-1
R
E
N
A
M
E
D
I
N
S
T
R
U
C
T
I
O
N
S
CLUSTER 0
CLUSTER N-1
S
T
E
E
R
I
N
G
L
O
G
I
C
R
E
N
A
M
E
D
I
N
S
T
R
U
C
T
I
O
N
S
Execution-driven Steering Dispatch-driven Steering
102
Steering Policies
In the case of dependence-based superscalar microarchitectures based on multiple win-
dows with dispatch steering, we tried a number of steering heuristics. Three of these are
described next.
1. Fifo steering. In this scheme the window is modeled as if it is a collection of fos with
instructions capable of issuing from any slot within each individual fo. The fos are
only a conceptual device used by the instruction assignment heuristic in reality,
instructions issue from the window with complete exibility. Instructions are steered
to the fos using the heuristic presented in Section 3.2. For example, a 32-entry win-
dow can be treated as eight fos with four slots each. An advantage of considering the
windows as a collection of fos is that it helps to keep majority of the communication
local and to achieve a good load balance at the same time.
2. Round-robin steering. In this scheme instructions in the dynamic stream are steered to
clusters in a round-robin fashion with a particular block size. For example, for a block
size of 16, the rst 16 instructions are steered to cluster 0, the next 16 instructions are
steered to cluster 1, and so on. The tacit assumption here is that dependences are local-
ized in the dynamic stream as shown by previous studies on the distribution of ILP in
programs [LW92,AS92]. In other words, instructions are dependent on other instruc-
tions that occur in close proximity (earlier) in the dynamic stream, i.e. independent
instructions are well separated in the dynamic stream. An important parameter in this
scheme is the block size. Using too small a block size can result in signicant cross-
cluster communication that can easily degrade performance by stretching the critical
path. On the other hand using too big a block size can also degrade performance
because now the number of functional units executing each block is a fraction of the
total machine resources, i.e. low utilization might hurt performance. A compiler can
assist this scheme by placing dependent instructions together. Studying the impact of
instruction reordering by the compiler on the performance of this scheme is beyond
the scope of this thesis.
103
3. Random steering. This steering heuristic is used as a basis for comparisons. Instruc-
tions are steered randomly to one of the clusters. If the window for the selected cluster
is full, then the instruction is inserted into the other clusters in a round-robin fashion.
This design point was evaluated in order to determine the degree to which depen-
dence-based microarchitectures are capable of tolerating the extra latency introduced
by slow inter-cluster bypasses and the importance of dependence-aware scheduling.
3.3.3 Complexity of Steering Policies
In addition to reducing inter-cluster communication and utilizing as many clusters as
possible, a good steering policy must also be fast. Low latency is essential since any extra
stages introduced in the front-end for steering can degrade performance (in terms of IPC)
due to increased branch mispredict and instruction cache miss penalties. This can even
nullify any advantages resulting from a faster clock. This section discusses the complexity
of the steering policies analyzed in this chapter.
Fifo steering. This steering policy can be implemented as shown in Figure 3-11. The
logic operates in parallel with the register rename logic. The number of entries in the
SRC_FIFO table is equal to the number of logical registers. The number of read ports
and write ports into the SRC_FIFO table is and respectively, where is
the issue width. Comparing the block diagram with the one for rename logic, shown in
Figure 2-3 on page 26, shows that the steering logic is functionally similar to the
Figure 3-11. Fifo steering hardware.
SRC_FIFO
TABLE
CHECK
CONFLICT
LOGIC
source
regs
logical
dest.
regs
logical
MUX
src fos
fo occupied
free
fos
fo ids
2 IW IW IW
104
rename logic. There are two differences. First, the SRC_FIFO table is smaller than the
rename map table as the width of each entry (determined by the number of fos) is
smaller than the width of the rename table. The second difference is that the output
MUX in the case of fo steering is slightly more complicated than that for the rename
logic. Overall, the hardware complexity of fo steering is similar to rename logic com-
plexity. Just as shown for rename logic in Chapter 2, the delay of the steering logic
increases linearly with issue width. Therefore, almost always the fo steering logic
can be performed in parallel with renaming. In the worst case, it might require an extra
pipestage in addition to the rename stages.
Round-robin steering. Since this simply requires a counter to count block size number
of instructions before incrementing the current cluster pointer, the logic for steering
is straightforward and can be accomplished in less time than the rename logic delay.
Hence, steering in this case can be completely hidden behind renaming. Also, the
delay of the steering logic is independent of issue width.
Random steering. Just like in the case of round-robin steering, the logic required for
random steering is straightforward and can be accomplished in less time than the
rename logic delay. Hence, once again, steering can be completely hidden behind
renaming. The delay of the steering logic is independent of issue width.
A natural question that arises in connection with instruction steering is: why cannot the
compiler steer instructions? This question is especially pertinent given that the compiler
has complete knowledge of register dependences between instructions and this is the criti-
cal information being used by the hardware to steer instructions. The key factor that makes
the compiler less effective than hardware is the inability of the compiler to look beyond
branches, i.e. detect the dynamic sequence of dependences created at run-time. Also, it is
not obvious how the compiler can pass dependence information to the underlying hard-
ware without compromising binary compatibility.
105
3.4 Experimental Evaluation
This section evaluates the performance of various dependence-based superscalar
microarchitectures by measuring the performance of benchmark programs running on a
detailed timing simulator. The timing simulator, a modied version of SimpleScalar
[BAB96], is detailed in Table 3.2. All the congurations studied in this section are 8-wide
the congurations can fetch, decode, rename, and execute a maximum of eight instruc-
tions every cycle. An aggressive fetch mechanism is used to stress the issue and execution
subsystems. The benchmark programs are from the SPEC95 suite using their training
input datsets. Each program was run for a maximum of 0.5B instructions
Fetch width any 8 instructions
I-Cache Perfect instruction cache
Branch predictor McFarlings gshare [McF93]
4K 2-bit counters, 12 bit history
unconditional control instructions pre-
dicted correctly
Issue window size 64
Maximum
in-ight instructions
120
Retire width 16
Functional units 8 symmetrical units
Functional unit latency 1 cycle
Issue mechanism out-of-order issue of up to 8 ops/cycle
loads may execute when all prior store
addresses are known
Physical registers 120int/120fp
D-Cache 32KB, 2-way SA
write-back, write-allocate
32 byte lines, 1 cycle hit, 6 cycle miss
four load/store ports
Table 3.2: Baseline simulation model
106
Simulated microarchitectures
Table 3.3 lists the various types of microarchitectures simulated here. The typical win-
dow-based microarchitecture, shown as the 1-cluster.1window conguration, assumes
uniform bypassing between all functional units within a single cycle, i.e. dependent
instructions can execute back-to-back. All the dependence-based microarchitectures com-
prise two clusters with inter-cluster bypasses taking an extra cycle. The 2-clus-
ter.1window.execsteer conguration is made up of two execution clusters each containing
half the execution resources of the machine. Renamed instructions are buffered in a central
window and routed to the execution clusters using the execution-driven steering policy
described in Section 3.3.1. In the 2-cluster.windows.randomsteer, 2-cluster.win-
dows.fosteer, and 2-cluster.windows.roundrobinsteer congurations, both the win-
dow and execution resources are partitioned into two clusters and renamed instructions are
routed to the clusters using random steering, fo steering, and round-robin steering poli-
cies respectively. The 2-cluster.windows.randomsteer design point was evaluated to
determine the importance of dependence-aware scheduling. The 2-cluster.fos.fosteer
conguration is identical to the 2-cluster.windows.fosteer except that fos are used in
each cluster instead of a completely exible window. Table 3.3 summarizes the various
microarchitectures simulated.
Conguration
Window
Organization
Steering
Heuristic
window.execsteer Flexible window Execution steering
fos.fosteer Fifos Fifo steering
windows.fosteer Flexible window Fifo steering
windows.roundrobinsteer Flexible window Round-robin steering
windows.randomsteer Flexible window Random steering
Table 3.3: Various microarchitectures simulated.
107
3.4.1 Performance Relative to an Ideal Superscalar
The rst set of experimental results, graphed in Figure 3-12, shows the performance of
various dependence-based superscalar microarchitectures relative to a typical window-
based microarchitecture in terms of instructions committed per cycle. A number of obser-
vations can be made from the graph. First, random steering consistently performs worse
than the other schemes. The performance degradation with respect to the ideal case varies
from 17% in the case of vortex to 23% in the case of m88ksim. Hence, it is essential for the
steering logic to consider dependences when routing instructions. Second, the microarchi-
tecture with a central window and execution steering performs nearly as well as the ideal
microarchitecture with a maximum degradation of 3% in the case of m88ksim. However,
as discussed earlier in Section 3.3.1, this microarchitecture requires a centralized window
with complex selection logic. Third, the 2-cluster.fos.fosteer, 2-cluster.win-
dows.fosteer, and 2-cluster.windows.roundrobin steer microarchitectures perform
competitively in comparison to the ideal microarchitecture. As expected, using completely
exible windows instead of fos helps improve performance slightly. Another way of
interpreting this result is that it reinforces the earlier nding that windows can be replaced
with the combination of fos and intelligent steering with little degradation in IPC. An
Figure 3-12. Performance of dependence-based superscalar microarchitectures.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
I
n
s
t
r
u
c
t
i
o
n
s
P
e
r
C
y
c
l
e
1-cluster.1window
2-cluster.windows.fifosteer
2-cluster.fifos.fifosteer
2-cluster.windows.roundrobin steer
2-cluster.windows.randomsteer
2-cluster.1window.execsteer
compress gcc go ijpeg li m88ksim perl vortex
108
interesting supplementary result is that round-robin steering, which can be implemented
using simple logic, performs as well as the more complex fo steering. However, as shown
later, round-robin steering does not scale well as the number of clusters and is increased.
Overall, the above results show that dependence-based superscalar microarchitectures
can deliver performance similar, in terms of instructions committed per cycle, to that of an
ideal microarchitecture with a large window and uniform, single cycle bypasses between
all functional units.
3.4.2 Effect of Increasing Number of Clusters
The graph in Figure 3-13 shows the effect of increasing the number of clusters on the
performance of fos.fosteer, windows.fosteer, and the windows.rrsteer microar-
chitectures. Performance uniformly degrades for the three designs as the number of clus-
ters is increased. This is expected since increasing the number of clusters augments load
imbalance and results in more frequent inter-cluster communication. The performance
degradation going from 2 clusters to 4 clusters for the fos.fosteer and windows.fos-
teer microarchitectures is in the 5%-10% range. For the windows.rrsteer microarchitec-
tures the performance degradation is in the 9%-17% range. For all the benchmarks, the
Figure 3-13. Effect of increasing number of clusters.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
I
n
s
t
r
u
c
t
i
o
n
s
P
e
r
C
y
c
l
e
1-cluster.single-window
2-cluster.fifos.fifosteer
2-cluster.windows.fifosteer
4-cluster.fifos.fifosteer
4-cluster.windows.fifosteer
2-cluster.windows.rrsteer 4-cluster.windows.rrsteer
compress gcc m88ksim vortex
109
performance of the round-robin steering policy degrades more than the fo steering policy.
This is mainly due to two reasons. First, the fo steering policy does a better job of
exploiting the full width of the machine. For example, it can use all the clusters coopera-
tively to execute a block of instructions. In the case of round-robin steering, the block of
instructions might be steered to a single cluster and hence, only the resources in that clus-
ter can be employed to execute the instructions, resulting in lower throughput. The second
reason for the superior performance of the fo steering policy is that it requires fewer
inter-cluster bypasses as compared to the round-robin steering heuristic. A simple exam-
ple explains this. Consider the case where there are 4 clusters each 2-wide (2 functional
units) and the dynamic stream is made up of two chains (parallelism is equal to 2). In this
situation, the fo steering policy will only utilize a single cluster since all instructions will
be routed to the two fos in the cluster. This eliminates inter-cluster communication com-
pletely in this example. The round-robin steering policy on the other hand, is oblivious of
the parallelism in the instruction stream, and uniformly steers instructions to all available
clusters. Therefore, in this case, inter-cluster communication is more frequent with the
round-robin steering policy than with the fo steering policy.
110
3.4.3 Effect of Increasing Inter-cluster Latency
The graph in Figure 3-14 shows the effect of increasing inter-cluster latency on the per-
formance of 2-cluster and 4-cluster fos.fosteer microarchitectures. Performance
degrades as the latency of inter-cluster communication is increased. This is expected since
increasing inter-cluster communication latency increases the time taken to perform any
computation that is spread across multiple clusters and hence, could easily stretch the crit-
ical path of the program. For 2-cluster congurations, the average performance degrada-
tion for 2-cluster systems when the inter-cluster latency is increased from 1
1
to 2 and from
2 to 3 cycles is 8.7% and 9.3% respectively. Similarly, for 4-cluster systems, the corre-
sponding performance degradations are 13.4% and 11.2% respectively. The reduction in
performance is higher for the 4-cluster systems since the number of instruction depen-
dences spread across clusters increases with the number of clusters. This shows that it is
extremely important to provide low latency inter-cluster communication for high perfor-
mance.
1. There is a single bubble between two dependent instructions executing in different clusters.
Figure 3-14. Effect of increasing inter-cluster latency.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
I
n
s
t
r
u
c
t
i
o
n
s
P
e
r
C
y
c
l
e
1-cluster.single-window
2-cluster.fifos.fifosteer.1 2-cluster.fifos.fifosteer.3 2-cluster.fifos.fifosteer.2
4-cluster.fifos.fifosteer.1 4-cluster.fifos.fifosteer.2 4-cluster.fifos.fifosteer.3
compress gcc m88ksim vortex
111
3.4.4 Inter-cluster Bypass Frequency
The graph in Figure 3-15 shows the frequency of inter-cluster communication for vari-
ous steering heuristics and 4-cluster congurations. Inter-cluster communication is mea-
sured in terms of the fraction of total instructions that exercise inter-cluster bypasses. This
does not include cases where an instruction reads its operands from the register le in the
cluster i.e. cases in which the operands arrive from the remote cluster in advance. As
expected, we see that there is a high correlation between the frequency of inter-cluster
communication and performance - congurations that exhibit higher inter-cluster commu-
nication commit fewer instructions per cycle. The inter-cluster communication is particu-
larly high in the case of random steering, reaching as high as 35% in the case of vortex.
Execution steering exhibits the lowest inter-cluster bypass frequency. This is not surpris-
ing because execution steering is based on the greedy policy of postponing selection to
favor execution of dependent instructions in the same cluster. Another observation that can
be made from the graph is that the fos.fosteer microarchitecture uniformly exercises
fewer inter-cluster bypasses than the windows.rrsteer microarchitecture. This is in
agreement with earlier discussion about how the fo steering policy dynamically adapts to
the number of clusters being used based on the parallelism in the instruction stream, thus
resulting in fewer inter-cluster bypasses.
Figure 3-15. Inter-cluster bypass frequency.
0
5
10
15
20
25
30
35
40
I
n
t
e
r
-
c
l
u
s
t
e
r
B
y
p
a
s
s
F
r
e
q
(
%
)
2-cluster.1window.execsteer 2-cluster.fifos.fifosteer
2-cluster.windows.rrsteer 2-cluster.windows.randomsteer
compress gcc m88ksim vortex
2-cluster.windows.fifosteer
112
3.4.5 Comparing against In-order Distributed Reservation Stations
Johnson [Joh91] proposed using in-order distributed reservation stations as a means of
reducing the complexity of the instruction window. Instructions are forced to issue in-
order from the reservation stations. The advantages of such a scheme are similar to those
of the fo-based microarchitecture; simpler wakeup and selection logic. The fo-based
microarchitecture differs from Johnsons scheme in the manner in which instructions are
steered to the fos. The dependence-based microarchitecture steers instructions based on
dependence information extracted at run-time instead of instruction type as in the case of
the in-order reservation stations scheme.
The graph in Figure 3-16 compares the performance of 2-cluster congurations based on
in-order distributed reservation stations and fo-based microarchitecture with fo steering
policy respectively. The dependence-based microarchitecture consistently performs better
than in-order reservation stations. The average performance degradation is as high as 27%.
This is mainly due to two factors. First, in the in-order reservation stations scheme,
instructions at the head of the reservation stations can block other ready instructions
behind them from issuing. Second, the instruction distribution logic in the in-order reser-
vation stations scheme makes no attempt to minimize the use of inter-cluster bypasses.
Figure 3-16. Comparing against in-order distributed reservation stations.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
I
n
s
t
r
u
c
t
i
o
n
s
P
e
r
C
y
c
l
e
2-cluster in-order res. stations
2-cluster fifo-based
compress gcc go ijpeg li m88ksim perl vortex
113
Butler and Patt [BP92] also report signicant performance degradation when the head-
only (fo) scheduling policy is used with distributed reservation stations.
3.5 Related Work
Tomasulo, in his original proposal [Tom67] on dynamic scheduling, proposed distrib-
uted reservation stations as an alternative to centralized reservation stations to reduce com-
plexity. Distributed reservation stations simplify selection logic. The selection logic at a
functional unit only has to monitor the instructions in the reservation stations associated
with that unit. However, the result tags still have to broadcast to all the reservation stations
just as in the case of centralized reservation stations, i.e. the complexity of window
wakeup logic remains the same.
Johnson [Joh91] proposed in-order distributed reservation stations to further reduce
issue-logic complexity. The fo-based microarchitecture presented in this chapter is simi-
lar to the in-order distributed reservation stations scheme in a number of respects. Both
distribute window entries and force in-order issue out of the distributed window entries to
simplify selection logic. However, there are two key differences. First, the fo-based
microarchitecture uses a prescheduling (steering) phase to determine a suitable fo to
place each instruction in. As shown in Section 3.4.5, this intelligent steering helps the
dependence-based microarchitecture extract more parallelism relative to in-order distrib-
uted reservation stations. Second, the dependence-based microarchitectures use clustering
to simplify wakeup logic. A cluster consists of a small number of branch , ALU, and mem-
ory units. Window operations and bypasses within a cluster complete within a single
cycle, thus facilitating back-to-back execution of dependent instructions residing in each
cluster. Tomasulos distributed reservation stations on the other hand clusters functional
units based on type. For example, all memory units are clustered together and so on. This
results in more cross-cluster trafc compared to the dependence-based microarchitectures.
An early CRAY-2 design [Unk79,SS90,Smi97] realized the importance of detecting and
exploiting dependences to facilitate a fast clock. The issue logic consisted of four instruc-
114
tion queues feeding eight execution units. A dependent chain of instructions were issued
to the same queue. The compiler was responsible for grouping dependent instructions
together. A single accumulator style instruction set helped express the grouping to the
hardware without the need for extra bits to explicitly specify dependences. The hardware
simply starts a new chain whenever it hits a LDA (load accumulator) instruction in the
instruction stream. As a result, the hardware does not have to extract dependence informa-
tion at run-time. The fo-based microarchitecture investigated in this chapter was partly
inspired by the CRAY-2 design. The primary difference is that hardware steering is used
instead of compiler steering. As explained before, hardware steering is well-suited for
integer codes since the small basic blocks and frequent control instructions in integer
codes can severely handicap compile-time steering of instructions to fos.
Kemp and Franklin [KF96] studied a microarchitecture called PEWS (Parallel Execu-
tion Windows) for simplifying the logic associated with a central window. PEWs simpli-
es window logic by splitting the central instruction window among multiple windows
much like the dependence-based microarchitectures described in this chapter. Register val-
ues are communicated between clusters (called pews) via hardware queues and a ring
interconnection network. In contrast, we assume a broadcast mechanism for the same pur-
pose. Instructions are steered to the pews based on instruction dependences with a goal to
minimize inter-pew communication. However, for their experiments Kemp and Franklin
assume that each of the pews has as many functional units as the central window organiza-
tion. This assumption implies that the reduction in complexity achieved is limited because
the wakeup and selection logic of the windows in the individual pews still have the same
porting requirements as the central window.
The DEC 21264 [Gwe96a] is the rst commercial microarchitecture implementing out-
of-order scheduling that was forced to use signicant microarchitectural changes, relative
to the conventional microarchitecture, to support a fast clock. Like the dependence-based
microarchitectures explored in this chapter, the execution units are partitioned into two
clusters with bypasses between clusters taking an extra cycle to complete. The selection
115
logic steers instructions buffered in a central window to the execution cluster based on
dependences. The exact steering algorithm used has not been made public.
Multiscalar processors [Bre,FS92, Fra93,SBV95] pioneered the concept of using decen-
tralized processor resources to reduce complexity. Multiple clusters, each similar in struc-
ture to a narrow superscalar, are used to execute different portions of the serial program.
The different portions of the program are called tasks and can be identied either by the
compiler or by the hardware. The design is highly decentralized. All major structures in
the pipeline, starting from the fetch hardware, are distributed. In addition, the paradigm
naturally supports advanced features like multiple ows of control and out-of-order fetch.
These features are considered essential for exploiting higher levels of parallelism [LW92,
Smi95] in future. While the Multiscalar design is a futuristic microarchitecture designed
with complexity-effectiveness in mind, it will take some time for the design to evolve and
for its implementation to become feasible. The dependence-based superscalar microarchi-
tectures explored in this chapter provide a smooth transition path, from the point of view
of implementation, to Multiscalar-like designs from current superscalar designs.
More recently, processor microarchitectures called Trace processors [VM97, RJSS97]
have been proposed that organize the microarchitecture around traces. Just like in the Mul-
tiscalar and dependence-based microarchitectures, execution resources are partitioned into
clusters. Each cluster is assigned a dynamic instruction trace for execution that is fetched
from a cache of traces called the trace cache. The trace cache in addition to providing a
high-bandwidth fetch mechanism also simplies rename logic by caching rename infor-
mation along with the trace. The trace processor microarchitecture can be viewed as a
dependence-based microarchitecture that has completely exible windows in each cluster
and steers instructions to clusters using a round-robin policy.
Farkas et al. [FCJV97] propose the multicluster microarchitecture to reduce the clock
cycle time of typical superscalar microarchitectures. The multicluster microarchitecture is
similar in concept to the dependence-based microarchitectures explored here. There are
116
two primary differences, however. First, the multicluster architecture uses compiler steer-
ing instead of hardware steering. Second, it uses explicit copy instructions to communicate
operand values between the clusters. Steering information is passed to the hardware indi-
rectly without changing the instruction set architecture. Each cluster is assigned a subset
of the architectural registers and instructions are steered based on the registers specied in
the instruction. A static scheduling heuristic chooses a cluster so that the load imbalance
between the two clusters
1
is minimized. Farkas et al. found that even this heuristic cannot
be directly addressed by the compiler because the work done by a cluster is a function of
the order in which instructions are issued, and the issue order is not deterministic for
dynamically-scheduled processors.
3.6 Chapter Summary
This chapter presented the design and evaluation of a family of complexity-effective
microarchitectures called dependence-based superscalar microarchitectures. These
microarchitectures facilitate a fast clock while exploiting similar levels of parallelism as
an ideal large-window machine. The proposed microarchitectures use a two-pronged strat-
egy for high performance. First, the issue window and execution resources are partitioned
to facilitate a fast clock. Second, instructions are intelligently steered, taking into account
dependences, to the different partitions in order to extract similar levels of parallelism as
an ideal large-window machine.
One of the dependence-based microarchitectures, called the fo-based microarchitec-
ture, detects chains of dependent instructions and steers the chains to fos which are con-
strained to execute in-order. Since only the instructions at the fo heads have to be
monitored for execution, the proposed microarchitecture simplies window logic. Further-
more, the microarchitecture naturally lends itself to clustering by grouping dependent
instructions together. This grouping of dependent instructions helps mitigate the bypass
problem to a large extent by using fast local bypasses more frequently than slow inter-
1. They only study 2-cluster systems.
117
cluster bypasses. The performance of a 2 X 4-way fo-based microarchitecture is com-
pared with a typical 8-way superscalar. The results show two things. First, the proposed
microarchitecture has IPC performance close to that of a typical microarchitecture (aver-
age degradation in IPC performance is 7.8%). Second, when taking the clock speed advan-
tage of the fo-based microarchitecture into account the 8-way proposed
microarchitecture is 14% faster than the typical window-based microarchitecture on aver-
age.
Overall, the experimental results presented show that dependence-based superscalar
microarchitectures are capable of extracting similar levels of parallelism as typical
microarchitectures while enabling a faster clock.
118
119
Chapter 4
Integer-Decoupled Microarchitecture
The integer-decoupled microarchitecture is a complexity-effective microarchitecture
that can improve the performance of integer programs with little or no increase in com-
plexity. It is particularly attractive since it can be implemented on top of current microar-
chitectures with relatively small hardware changes. This chapter proposes and evaluates
the integer-decoupled microarchitecture.
Integer-decoupled microarchitectures execute some of the integer instructions, those not
involved in computing addresses and accessing memory, on idle oating-point resources
that have been augmented to perform simple integers operations. The compiler identies
computation to off-load to the oating-point subsystem. This results in a number of bene-
ts for integer programs including extra issue width, a bigger effective window, and
decoupling of memory access from the actual computation.
Another way to look at the integer-decoupled microarchitecture, in the context of depen-
dence-based microarchitectures presented in previous chapter, is that the existing oating-
point subsystem provides an extra cluster, for free, that can used for executing integer
120
instructions. However, unlike the dependence-based microarchitectures, instruction steer-
ing in this case is performed by the compiler.
The rest of the chapter is organized as follows. Section 4.1 presents the concept behind
the integer-decoupled microarchitecture. Section 4.2 discusses the hardware additions that
have to be made to the conventional microarchitecture. Section 4.3 illustrates, with an
example, the kind of computation that is off-loaded to the augmented FP subsystem.
Section 4.4 discusses the role of the compiler and the basic partitioning scheme used by
the compiler. Section 4.5 shows how the basic partitioning scheme can be improved using
copy instructions and code duplication. Section 4.6 presents the results of an experimental
evaluation of the proposed microarchitecture. Finally, the chapter is summarized in
Section 4.8.
4.1 Concept
To motivate the proposed microarchitecture, consider how the conventional microarchi-
tecture illustrated in Figure 1-1 on page 2 works. The instruction fetch unit reads multiple
instructions from the instruction cache and feeds them to integer and oating-point sub-
systems for execution. The integer subsystem contains a number of load/store, branch, and
functional units that operate on integer operands. The oating-point subsystem is similar
to the integer subsystem except it does not contain load/store units, and it operates on
oating-point operands. Instruction windows, in the form of buffers, are used to decouple
the instruction fetch unit from the integer and oating-point execution subsystems.
Partitioning issue and execution resources into integer and oating-point subsystems has
several advantages. First, as shown in Chapter 2, it eliminates the cycle time penalties
associated with centralized structures. For example, registers are divided into integer and
oating-point les, each with a set of ports. And, the instruction window is similarly
divided with separate issue logic. Second, while executing oating-point programs, the
microarchitecture naturally decouples addressing and oating-point computation: address
computation executes in the integer subsystem while oating-point computation executes
121
in the FP subsystem so that dynamic scheduling between the two can be enhanced. Third,
since integer data and oating-point data typically have different widths (32-bit versus 64-
bit), using separate integer and oating-point subsystems helps reduce implementation
complexity and save silicon area. The last benet will be nullied by the move towards
64-bit instruction set architectures in which both integer and oating-point data are 64 bits
wide. The uniform use of 64-bit data in both integer and oating-point subsystems enables
the optimization being proposed here.
This microarchitecture style leads to idle oating-point resources registers, functional
units, instruction window logic, and buses while executing integer programs or integer-
intensive portions of oating-point programs. To address this drawback, we propose a
more general decoupled microarchitecture style based on earlier work
[BRT93,GHL
+
85,PD83,Smi82,S
+
87], in which the oating-point subsystem executes
both integer and oating-point operations. In this microarchitecture, which we refer to as
the integer-decoupled microarchitecture, a load/store subsystem (LdSt) that mostly exe-
cutes integer instructions involved in effective address calculation and memory access. A
computation (Comp) subsystem supports all oating-point operations as well as non-
memory related integer computation. The integer decoupled microarchitecture can be built
on top of the conventional microarchitecture with relatively few hardware additions. These
hardware changes are discussed in the next section.
The integer-decoupled microarchitecture has a number of performance advantages over
a conventional microarchitecture for integer programs. First, it provides extra issue and
execution bandwidth for integer programs. For example, by implementing the integer-
decoupled microarchitecture, a superscalar processor with 2 integer and 2 oating-point
functional units can provide an issue and execution width of 4 for most integer codes. Sec-
ond, by using the instruction window in the oating-point subsystem, the integer-decou-
pled microarchitecture provides a larger overall window. This can potentially increase the
amount of parallelism exploited. Third, the compiler now has 64 logical registers (32 int
and 32 fp) for holding integer variables instead of the usual 32. Finally, the integer-decou-
122
pled microarchitecture often facilitates early resolution of mispredicted branches. If the
branch computation associated with a mispredicted branch executes in the less heavily
loaded Comp subsystem then it is very likely that the branch will be resolved earlier rela-
tive to the conventional microarchitecture
The integer-decoupled concept can also be used to reduce the complexity of a conven-
tional superscalar microarchitecture. By steering integer instructions to the augmented
oating-point subsystem, the integer-decoupled microarchitecture does not require as
many issue window entries in the integer subsystem as the conventional microarchitecture.
Similarly, it can be used to reduce the size of the physical register le in the integer sub-
system. Ideally, the complexity of a -wide conventional microarchitecture can be
reduced by implementing it as an integer-decoupled microarchitecture with the LdSt and
Comp subsystems each being -wide. This advantage of the integer-decoupled
microarchitecture is not quantied here.
4.2 Changes to the Conventional Microarchitecture
The integer-decoupled microarchitecture remains very similar to a conventional
microarchitecture. The only hardware modication required is augmenting the existing
oating-point functional units to perform simple integer operations. There needs to be no
additional cost for registers and buses if the integer operations are embedded in the exist-
ing oating-point functional units and share the existing register le ports and buses. Sim-
ilarly, instruction fetch and issue resources are unchanged. The only extra costs are the
additional gates required to implement the simple integer operations and the opcodes for
specifying these operations. Results presented later show that the gate-intensive integer
multiply and divide operations need not be duplicated and hence, the extra cost should not
be a factor.
The instruction set architecture (ISA) has to be minimally augmented to include the sim-
ple integer operations that operate on the oating-point registers. The changes required are
similar in spirit to the recent multimedia extensions introduced by most microprocessor
n
n 2
123
vendors [Gwe95c, Gwe96b]. The integer opcodes of the SimpleScalar [BAB96] ISA that
are supported in the Comp subsystem are shown in Table 4.1. Because the oating-point
opcode space is usually relatively sparse compared to the integer opcode space, and about
21 extra opcodes are required, the necessary ISA extensions are realistic.
4.3 Partitioning the Program
Given the constraints of the integer-decoupled microarchitecture, let us look at the kind
of integer computation that can be off-loaded to the Comp subsystem and the role of the
compiler in identifying such computation. Because we want to decouple address computa-
tion from the rest of the program computation, all load/store instructions and integer
instructions involved in effective address computation are assigned to the LdSt subsystem.
All other sequences of instructions terminate either in the computation of branch out-
comes or store values. The instruction sequences, called branch computation and store-
value computation, are ideal candidates for execution in the Comp subsystem because they
do not require any special support in the Comp subsystem. The result of a branch compu-
tation, the branch outcome, is sent to the fetch unit where it is used to validate the pre-
dicted outcome. This functionality is present in existing oating-point subsystems for
oating-point branches. The result of a store-value computation, the value being stored, is
deposited in the write buffer where it merges with the corresponding store address gener-
ated by the LdSt subsystem. This mechanism is also implemented in current oating-point
subsystems to store oating-point values. However, some store-value and branch compu-
tations might not be assigned to the Comp subsystem if the instructions in these computa-
Operation type Opcodes
Control bgez bgtz blez bltz bne
Logical andi nor ori xori sllv sll srav sra srlv srl
Arithmetic addi addiu addu lui slti sltiu
Table 4.1: Extra opcodes supported in the Comp subsystem.
124
tions are also involved in address computation. The example to be presented next
illustrates this.
Figure 4-1 shows a program fragment in C from invalidate_for_call, a frequently exe-
cuted function in the SPEC benchmark gcc. The for-loop in the program runs through all
the pseudo registers and does some bookkeeping for those that are invalidated by function
calls. The gure shows assembly code compiled for a conventional microarchitecture. The
whole program executes in the integer subsystem leaving the oating-point subsystem
completely idle.
With very little effort, the assembly code shown in Figure 4-1 can be transformed to off-
load some of the integer computation to the Comp subsystem as shown on the left in
Figure 4-1. An example program fragment.
extern unsigned long regs_inv_by_call;
for (regno = 0; regno < FIRST_PSEUDO_REG; regno++)
if (regs_inv_by_call & (1 << regno)) {
delete_equiv_reg(regno);
if (reg_tick[regno] >= 0)
reg_tick[regno]++;
}
lw $2, regs_inv_by_call I2:
sra $2, $2, $16 I3:
move $16, $0 I1:
andi $2, $2, 0x1 I4:
beq $2, $0, $L4 I5:
move $4, $16 I6:
jal delete_equiv_reg I7:
lw $3, reg_tick I8:
sll $2, $16, 2 I9:
addu $2, $2, $3 I10:
lw $4, 0($2) I11:
bltz $4, $L4 I12:
addu $4, $4, 1 I13:
sw $4, 0($2) I14:
addu $16, $16, 1 I15:
slt $2, $16, 66 I16:
bne $2,$0,$L5 I17:
$L5:
$L4:
/* regno = 0 */
/* $2=regs_inv_by_call & (1<<regno) */
/* $4 = reg_tick[regno] */
/* reg_tick[regno]++ */
/* regno++ */
/* regno < FIRST_PSEUDO_REG */
125
Figure 4-2. Integer instructions that execute in Comp are shown in bold with a ,c sufx.
The load instruction, I11, instead of loading into integer register $4, now loads the value
into oating-point register $f0. Instructions I12 and I13 operate on the loaded value in
oating-point register $f0 and execute in the Comp subsystem. The result of the branch
instruction (I12) is sent from the Comp subsystem to the fetch unit to validate the predic-
tion made. The result of the add instruction (I13) is sent to the store buffer where it is
merged with the address generated by the store instruction (I14) executed in the LdSt sub-
system. The load and store instructions (I11 and I14) are italicized to point out that these
instructions now load and store oating-point registers. These are the same as oating-
point load and store instructions in the conventional microarchitecture. Relating the exam-
ple to the discussion earlier, the branch computation and store-value computation that are
off-loaded in this case are the singleton sets {I12} and {I13} respectively. The branch
computation {I15, I16, and I17} was not assigned to the Comp subsystem because instruc-
tion I15 is also involved in generating the address for the load instruction I11.
Figure 4-2. Code partitioning for example fragment.
lw $2, regs_inv_by_call I2:
sra $2, $2, $16 I3:
move $16, $0 I1:
andi $2, $2, 0x1 I4:
beq $2, $0, $L4 I5:
move $4, $16 I6:
jal delete_equiv_reg I7:
lw $3, reg_tick I8:
sll $2, $16, 2 I9:
addu $2, $2, $3 I10:
lw $f0, 0($2) I11:
bltz,c $f0, $L4 I12:
addu,c $f0, $f0, 1 I13:
sw $f0, 0($2) I14:
addu $16, $16, 1 I15:
slt $2, $16, 66 I16:
bne $2,$0,$L5 I17:
$L5:
$L4:
lw $f4, regs_inv_by_call I2:
sra,c $f4, $f4, $f2 I3:
move $16, $0 I1:
andi,c $f4, $f4, 0x1 I4:
beq,c $f4, $0, $L4 I5:
move $4, $16 I6:
jal delete_equiv_reg I7:
lw $3, reg_tick I8:
sll $2, $16, 2 I9:
addu $2, $2, $3 I10:
lw $f0, 0($2) I11:
bltz,c $f0, $L4 I12:
addu,c $f0, $f0, 1 I13:
sw $f0, 0($2) I14:
addu $16, $16, 1 I15:
slt,c $f4, $f2, 66 I16:
bne,c $f4,$0,$L5 I17:
$L5:
$L4:
cp_comp $f2, $16 I1
:
addu,c $f2, $f2, 1 I15
:
Basic partitioning scheme Advanced partitioning scheme
126
In the transformation just presented, computation was off-loaded to the Comp subsystem
without introducing new instructions in the program. However, by strategically inserting
copy instructions and duplicating some instructions, additional computation can be off-
loaded to the Comp subsystem. For example, consider the transformation presented on the
right in Figure 4-2. The copy instruction (I1
) help off-
load a sizable fraction of the total computation to the Comp subsystem. Now, as many as
seven static instructions of the original program execute in the Comp subsystem.
The compiler for the integer-decoupled microarchitecture is responsible for effecting the
transformations presented above. More abstractly, the compiler is responsible for parti-
tioning the original program into LdSt and Comp partitions. The transformation on the left
in Figure 4-2 is a result of the basic partitioning scheme used by the compiler. In this
scheme, no new instructions are introduced and communication between the two sub-
systems happens via loads and stores that already exist in the original program.
Section 4.4 discusses the basic scheme in detail. The second transformation is a result of
the advanced partitioning scheme used by the compiler. In this scheme, the compiler intel-
ligently introduces a few extra instructions in the form of copy or duplicate instructions to
enable off-loading of more computation to the Comp subsystem. Section 4.5 discusses the
advanced partitioning scheme.
4.4 Basic Partitioning Scheme
As mentioned earlier, the basic partitioning scheme off-loads computation to the Comp
subsystem without introducing new instructions. In this section, some terminology is pre-
sented rst to aid subsequent discussion. Then, the necessary conditions that need to be
satised for branch and store-value computation to be assigned to the Comp subsystem are
described. Finally, the partitioning algorithm used by the compiler is presented.
127
4.4.1 Terminology and Data Structures
A slice [Wei84] of a program P with respect to a value v is dened to be the subset of P
that is involved in the computation of v. We term this the backward slice of P with respect
to v and represent it as Backward-Slice(P,v). The forward slice of P with respect to v is all
computation that is affected by v, and is represented as Forward-Slice(P,v). An example is
shown in Figure 4-3.
To partition a program, the compiler uses a data structure called the static dependence
graph that compactly represents all the register dependences in a program. The static
dependence graph (SDG) is a directed graph which has a node corresponding to each
static instruction in the program. The SDG has an edge from node v
i
to node v
j
if instruc-
tion i produces a register value that could be consumed by instruction j. Load and store
instructions are special cased in the SDG to simplify the partitioning algorithm. Each load
instruction is split into two nodes - one representing the load address and the other repre-
senting the loaded value. Similarly, each store instruction is split into two nodes - one rep-
resenting the store address and the other representing the store value. This is done because
a load instruction executes in the LdSt subsystem, but the value can be loaded into either
subsystem. Likewise, the value being stored can come from either the LdSt subsystem or
the Comp subsystem.
Figure 4-4 shows the SDG for the program fragment in Figure 4-1. Nodes 2, 8, and 11
correspond to load instructions and have been split. To show that both nodes correspond to
a single program instruction, the split nodes have been enclosed in a bigger oval node.
a = b + c;
d = a * g;
f = d + 2;
Program P
a = b + c;
d = a * g;
Backward-Slice(P,f)
d = a * g;
f = d + 2;
Forward-Slice(P,a)
Figure 4-3. Program slices.
128
Similarly, node 14 corresponds to a store instruction and has been split. The edges corre-
spond to register dependences. For example, instruction I3 produces $2 that is used by
instruction I4 and hence, there is an edge between I3 and I4.
4.4.2 Partitioning Conditions
Given a program P, let
Any partition of G into L(G) and C(G) must satisfy two conditions. First, L(G) and C(G)
must be disjoint. Second, a node v C(G) should satisfy the following conditions:
1. Backward-Slice(G,v) L(G) = . For a node v C(G), this conditions species that v
or any of its ancestors should not receive any value from L(G).
G = SDG for P
LS(G) = Set of load/store address nodes in G
C(G) = Comp partition of G
L(G) = LdSt partition of G
Figure 4-4. Static dependence graph for example program.
lw $2, regs_inv_by_call I2:
sra $2, $2, $16 I3:
move $16, $0 I1:
andi $2, $2, 0x1 I4:
beq $2, $0, $L4 I5:
move $4, $16 I6:
jal delete_equiv_reg I7:
lw $3, reg_tick I8:
sll $2, $16, 2 I9:
addu $2, $2, $3 I10:
lw $4, 0($2) I11:
bltz $4, $L4 I12:
addu $4, $4, 1 I13:
sw $4, 0($2) I14:
addu $16, $16, 1 I15:
slt $2, $16, 66 I16:
bne $2,$0,$L5 I17:
$L5:
$L4:
14
8
*
8
10
9
11
*
11
13
12
15
14
*
4
3
16
5
6
7
2
2
*
1
17
Loads
Store
LdSt inst
Comp inst
129
2. Forward-Slice(G,v) L(G) = . For a node v C(G), this condition species that v
or any of its descendants should not supply any value to L(G).
Clearly, nodes in LS(G) must be in L(G) because only the LdSt subsystem can execute
loads and stores. Instructions in the backward slices of these address nodes are involved in
addressing. The union of these backward slices is termed the LdSt slice. It follows from
the backward slice condition that the LdSt slice must also be assigned to L(G).
For our example program repeated in Figure 4-4,
It can be easily veried that all nodes in C(G) satisfy the backward and forward slicing
conditions. The branch computation {16, 17} could not be assigned to the Comp sub-
system because node 16 is supplied a value by node 15 which is in the LdSt slice and
hence in L(G). If this branch computation were assigned to Comp, then the backward slice
condition would be violated for nodes 16 and 17.
4.4.3 Partitioning Algorithm
The goal of the partitioning algorithm is to nd the largest set C(G) that satises the par-
titioning conditions presented previously. A simple and fast algorithm for identifying the
largest set C(G) based on the observation that the partitioning conditions specied previ-
ously can be restated as reachability conditions on the undirected graph G
u
corresponding
to G.
Let G
u
be the undirected graph corresponding to G, i.e. G
u
consists of the same vertices
and edges as G, but the edges are undirected. Then, the slicing conditions can be inter-
preted as : If v C(G), then v is not reachable from any node in L(G
u
). So, every con-
LS(G) = {2, 8, 11, 14},
C(G) = {11
*
, 12, 13, 14
*
}, and
L(G) = G - C(G) = {1, 2, 2
*
, 3, 4, 5, 6, 7, 8, 8
*
, 9, 10, 11, 14, 15, 16, 17}
130
nected component in G
u
either belongs to L(G
u
) or C(G
u
) but is not shared between the
two partitions. Thus, if a connected component contains a load or a store address node,
then the connected component must be assigned to the LdSt partition because the load/
store instruction is assigned to LdSt. Conversely, if a connected component contains a
branch of store value and does not contain any load/store address node, then the connected
component is assigned to the Comp partition.
The graph in Figure 4-4 has four connected components. One component consists of
nodes {11
*
,12,13,14
*
}. Since this component does not contain any load/store address
nodes, it can be assigned to the Comp subsystem. In contrast, all the other components
contain load/store address nodes and hence are assigned to the LdSt subsystem.
The complexity of the algorithm based on reachability is O(|V| + |E|) where |V| is the
number of nodes in the SDG and |E| is the number of edges in the SDG. This directly fol-
lows from the result that the connected components of an undirected graph can be com-
puted in O(|V| + |E|) time [CLL92].
4.5 Advanced Partitioning Schemes
This section discusses advanced partitioning techniques that relax the restrictions on
inserting extra instructions in order to nd more computation to off-load to the Comp sub-
system. The restrictions are relaxed in two ways. First, the advanced schemes assume the
availability of copy instructions that can copy values between the LdSt and Comp register
les without accessing memory. Such instructions are present in a number of ISAs (e.g.
MIPS [KH92] and Alpha [Dig96]). Second, the advanced scheme duplicates some instruc-
tions to arrive at better partitions. Copy and duplicate instructions can not only increase
the size of the Comp partition, but can also increase the total number of dynamic instruc-
tions executed and instruction cache miss rates. Hence, care must be taken to minimize the
overheads associated with copy and duplicate instructions. Our heuristics take into
account these overheads. It is shown in Section 4.6 that our heuristics introduce very few
extra instructions.
131
4.5.1 Limitations of the Basic Partitioning Scheme
The need for advanced partitioning schemes is rst motivated by presenting specic
examples where the basic partitioning algorithm is limited in its ability to move computa-
tion to the Comp subsystem.
Function calls limit the ability of the basic partitioning algorithm in nding Comp com-
putation in the called function and near the call site because calling conventions require all
the integer-value arguments to be passed in integer registers and the return value to be
returned in an integer register. Since the basic scheme is constrained not to introduce extra
(copy) instructions, all instructions at the call site that compute argument values, and all
instructions inside the function that use argument values are assigned to the LdSt sub-
system. The same holds for instructions that compute function return values and instruc-
tions that use function return values. One solution to this problem is to use copy
instructions. Once could let the algorithm partition code ignoring the restrictions imposed
Figure 4-5. Partitioning with copies.
lw $f4, regs_inv_by_call I2:
sra,c $f4, $f4, $f2 I3:
move $16, $0 I1:
andi,c $f4, $f4, 0x1 I4:
beq,c $f4, $0, $L4 I5:
move $4, $16 I6:
jal delete_equiv_reg I7:
lw $3, reg_tick I8:
sll $2, $16, 2 I9:
addu $2, $2, $3 I10:
lw $f0, 0($2) I11:
bltz,c $f0, $L4 I12:
addu,c $f0, $f0, 1 I13:
sw $f0, 0($2) I14:
addu $16, $16, 1 I15:
slt,c $f4, $f2, 66 I16:
bne,c $f2,$0,$L5 I17:
$L5:
$L4:
14
8
*
8
10
9
11
*
11
13
12
15
14
*
4
3
16
5
6
7
2
2
*
1
17
Loads
Store
LdSt inst
Comp inst
cp_comp I1
: $16,$f2
cp_comp $16, $f2 I15
:
1
15
Copy inst
132
by the calling conventions and later, when necessary, introduce copies to adhere to the
conventions.
If any branch or store-value computation in the program is supplied a value by any
addressing instruction, then the basic partitioning scheme assigns that computation to the
LdSt subsystem. Figure 4-4 shows the SDG and the partitioning generated by the basic
partitioning scheme for our running example. In the example, the branch computations
{I16, I17} and {I2, I3, I4, I5} are supplied by the addressing instructions I1 and I15 and
hence could not be assigned to Comp. By inserting copies for the results of I1 and I15,
these branch computations can execute in Comp. Figure 4-5 shows the code generated and
the associated SDG when this is done. In this example, copies have enabled the off-load-
ing of ve more instructions to Comp. Since I1
.
For this example code-duplication can be used to achieve the same partitioning as real-
ized by inserting copies. In the C code fragment shown in Figure 4-1, the loop induction
variable regno is used both for address computation as well as for branch computation.
By duplicating the induction variable regno in Comp, the two pieces of code can proceed
independently without any communication. Figure 4-6 shows the assembly code and the
associated SDG when this is done. I1
and I15
is outside the
loop, duplication overheads are repeatedly incurred only for node I15
.
Thus, copy instructions and code duplication can achieve better code partitioning. How-
ever, arbitrary use of these techniques can hurt performance because copies and duplicates
may introduce overhead. The advanced partitioning algorithm used by the compiler
employs a cost model to identify protable sites for copy insertion and code duplication.
The cost model and the algorithm are briey described here. Subramanya Sastry was a
major contributor in designing the cost model and the advanced partitioning algorithm.
They are discussed in detail by Sastry et al. [SPS98].
133
4.5.2 Cost Model
Intuitively, the benet from a copy instruction or a duplicated instruction is the number
of extra dynamic instructions that will execute in the Comp subsystem as a result of the
copy/duplicate inserted. Symbolically, given a SDG G,
The nodes in S
c
execute in Comp yielding a bigger Comp partition. However, execution
of nodes in S
copy
and S
dupl
introduces overhead in the program. It is benecial to introduce
these copies and duplicates only if the increase in size of the Comp partition offsets the
overhead. This is quantied by the following equations.
Let S
copy
be the set of nodes in G for which copies are inserted.
Let S
dupl
be the set of nodes in G which are duplicated
Let S
c
be the set of nodes in G that can be moved to from LdSt to Comp as a result of
the copies and duplicates.
Figure 4-6. Partitioning with code duplication.
lw $f4, regs_inv_by_call I2:
sra,c $f4, $f4, $f2 I3:
move $16, $0 I1:
andi,c $f4, $f4, 0x1 I4:
beq,c $f4, $0, $L4 I5:
move $4, $16 I6:
jal delete_equiv_reg I7:
lw $3, reg_tick I8:
sll $2, $16, 2 I9:
addu $2, $2, $3 I10:
lw $f0, 0($2) I11:
bltz,c $f0, $L4 I12:
addu,c $f0, $f0, 1 I13:
sw $f0, 0($2) I14:
addu $16, $16, 1 I15:
slt,c $f4, $f2, 66 I16:
bne,c $f2,$0,$L5 I17:
$L5:
$L4:
14
8
*
8
10
9
11
*
11
13
12
15
14
*
4
3
16
5
6
7
2
2
*
1
17
Loads
Store
LdSt inst
Comp inst
move,c I1
: $f2, $0
addu,c $f2, $f2, 1 I15
:
1
15
Dupicate inst
134
where:
Hence, it is benecial to introduce copies and duplicate instructions only if Prot 0.
4.5.3 Algorithm for Introducing Copies and Duplicating Code
A simple heuristic is used to decide whether a given node v should be copied
1
or dupli-
cated. The heuristic uses the number of parents of the node as input. The heuristic favors
duplication of the node if it has few parents or if the node has parents outside its enclosing
loop. In our example program, nodes 1 and 15 are candidates for copying/duplication.
Because node 15 is within a loop, both techniques introduce an overhead of one instruc-
tion per loop iteration. Duplication of node 15 requires that node 1 be duplicated/copied.
Because node 1 is outside the loop, duplication is preferable.
The advanced partitioning algorithm starts by initializing the LdSt partition to be the
LdSt slice. Then the algorithm iteratively expands the LdSt partition to include instruc-
tions that are not protable for execution in the Comp subsystem. It does so by analyzing
the instructions on the boundary between the LdSt and Comp partitions for execution in
1. to be more precise, the result of node v should be copied.
B(I): Basic block containing instruction I
n
B
: Number of times basic block B executed at run-time
o
copy
: Overhead of a copy instruction
o
dupl
: Overhead of a duplicate instructions
Benefit n
B v ( )
v Sc
=
Overhead o
copy
n
B v ( )
v Scopy
o
dupl
n
B v ( )
v Sdupl
+ =
Profit Benefit Overhead =
135
the Comp subsystem. The boundary is made up of LdSt nodes whose children are not in
LdSt. For each child of a boundary instruction, the algorithm essentially checks if the ben-
et of executing the child instruction in the Comp subsystem is positive, taking into
account the extra copies and duplicate instructions that might be necessary. If not, the
boundary is expanded to include the instruction in the LdSt partition. The algorithm stops
when the boundary can no longer be grown. The advanced partitioning algorithm is
described in further detail by Sastry et al. [SPS98].
4.6 Experimental Evaluation
4.6.1 Evaluation Methodology
We used gcc-2.7.1 as the base compiler for studying the partitioning schemes. The com-
piler was modied by Subramanya Sastry to generate code for the extended SimpleScalar
[BAB96] ISA which is based on the MIPS ISA. The SimpleScalar instruction set was
extended by using new opcodes to encode integer instructions executing in the augmented
oating-point subsystem. For the conventional microarchitecture, the benchmark pro-
grams are compiled by the base compiler (unmodied gcc-2.7.1).
Code partitioning is performed on the intermediate representation of the program. This
is done only after the initial machine-independent optimizations [ASU88] like loop-invari-
ant code motion, constant propagation, common subexpression elimination, etc., are com-
plete. Register allocation is performed only after code partitioning is performed. Operands
of instructions in Comp are allocated oating-point registers.
A timing simulator based on the SimpleScalar tool set [BAB96] was used for perfor-
mance evaluations. The timing simulator models both a conventional and an integer-
decoupled microarchitecture. Both microarchitectures are identical except for execution of
integer operations in the oating-point subsystem. The simulator is cycle-based and the
machine parameters simulated for the 4-way and 8-way issue machines are detailed in
Table 4.2.
136
We used programs from the SPECint95 benchmark suite to conduct our evaluation. The
benchmarks and the inputs used are given in Table 4.3. The base optimization level used
for compiling the benchmarks is -O3 which enables common subexpression elimination,
loop invariant removal, and jump optimizations among others. All the benchmarks were
run to completion. Compress had the lowest instruction count at 410 millions instructions
and perl had the highest at 1.2 billion instructions.
Parameter 4-way 8-way
Fetch width any 4 instructions any 8 instructions
I-Cache 32 KB, 2-way set associative 64 byte lines, 1 cycle hit
time 6 cycle miss penalty
Branch predictor McFarlings gshare[McF93] with 1M 2-bit counters,
20bit global history, unconditional control ow instruc-
tions predicted perfectly
Rename width any 4 instructions any 8 instructions
Issue window size 16 int/16 fp 32 int/ 32 fp
Max. in-ight insts 32 64
Retire width 4 8
Functional units 2 Int + 2 Fp units 4 Int + 4 Fp units
Functional unit
latency
6 cycle mul, 12 cycle div, 1 cycle for rest
Issue mechanism up to 4 ops/cycle up to 8 ops/cycle
out-of-order issue loads may execute when prior store
addresses are known
Physical registers 48 int/48 fp 80 int/80 fp
D-Cache 32 KB, 2-way set-associative, write-back, write-allo-
cate, 32 byte lines, 1 cycle hit time, 6 cycle miss penalty
one load/store port two load/store ports
Table 4.2: Machine parameters.
137
4.6.2 Performance Results
In this subsection, results for the performance of the two partitioning schemes and the
net speedups possible with the integer-decoupled microarchitecture are presented. All our
results are based on the assumption that only the simple integer operations shown in
Table 4.1 are supported in the Comp subsystem. We then examine the impact on perfor-
mance of supporting some of the more complex integer operations in the Comp sub-
system.
Percentage of Computation Off-loaded to the Comp subsystem
The graph in Figure 4-7 shows the percentage of total dynamic instructions off-loaded
by the compiler for each of the benchmark programs. The graph shows the size of the
Comp partition for both the basic and the advanced partitioning schemes. Because all the
benchmark programs are integer programs that execute negligible oating-point instruc-
tions, the bars in the graph correspond to the amount of integer computation that the com-
piler is able to identify and off-load to the Comp subsystem. Overall, the compiler is
successful in off-loading a sizable fraction of the total computation to the Comp sub-
system. In the case of ijpeg, m88ksim, and gcc more than 20% of the total computation is
supported in the Comp subsystem.The graph also shows that the advanced partitioning
Benchmark Input
compress test.in
li browse.lsp
gcc stmt.i
m88ksim ctl.raw, dhry.big
go 2stone9.in
ijpeg vigo.ppm
perl srabbl.ppl
Table 4.3: Benchmark programs.
138
scheme generates bigger partitions than the basic scheme for all the benchmarks. For perl,
go, and compress, the partitions generated by the advanced partitioning scheme are almost
twice the size of those generated by the basic scheme. Ijpeg benets the most from the
advanced scheme: the Comp computation increases from 10.7% to 32.1%. However, for li,
the advanced scheme does not perform better than the basic scheme because li is call
intensive and has a number of small functions.
While the advanced partitioning scheme might be able to off-load more computation, the
percentages must be judged in conjunction with the change in the instruction cache perfor-
mance and the total number of instructions executed due to the extra instructions intro-
duced. Hence, we studied the overhead introduced by the advanced partitioning scheme.
For all the benchmarks, we found the change in static code size to be negligible. As a
result there was very little change in I-cache hit rates for all the benchmarks. Only in the
case of perl was there a noticeable increase in I-cache hit rate by 1.8%. The increase in the
number of dynamic instructions executed is also small. The maximum increase is 2% for
compress. Copies account for 0.6% and the rest, 1.4% is due to duplicates. For gcc, there
is a 1.2% increase in instruction count, half of which resulted from an increase in loads
and stores. Copies and duplicates accounted for the rest. Overall, these results show that
the advanced partitioning scheme is successful in increasing the Comp partition sizes
without introducing a lot of overhead.
Figure 4-7. Percentage of instructions assigned to Comp.
0
5
10
15
20
25
30
35
40
45
Basic scheme
Advanced scheme
compress gcc go ijpeg li m88ksim perl
I
n
s
t
r
u
c
t
i
o
n
s
i
n
C
o
m
p
(
%
)
139
Performance Improvements
The graph in Figure 4-8 shows the performance improvements obtained by the integer-
decoupled microarchitecture over a conventional microarchitecture for the 4-way issue (2
int + 2 fp) machine. Improvements due to both the basic and the advanced partitioning
schemes are presented. For m88ksim, compress, and ijpeg, performance improvements
over 10% are achieved with the advanced partitioning scheme. In the case of m88ksim, an
impressive improvement of 23% is achieved with the advanced partitioning scheme. Over-
all for the 4-way machine, the integer-decoupled microarchitecture coupled with the
advanced partitioning scheme is capable of providing modest to impressive speedups over
the conventional microarchitecture.
As expected, performance improvements increase as more instructions are off-loaded to
the Comp subsystem. However, the improvements do not directly reect the size of the
Comp partitions, i.e. a bigger Comp partition does not necessarily result in a greater per-
formance improvement, for two reasons. First, the load imbalance between the LdSt and
the Comp partitions results in lower speedups than expected. For example, the Comp par-
tition of ijpeg with advanced partitioning is bigger than that of m88ksim with basic parti-
tioning, but the corresponding improvement of ijpeg is much smaller than that of m88ksim.
We found load imbalance to be the culprit in this case. There are phases in which majority
of the computation is supported in the Comp subsystem leaving the LdSt subsystem rela-
Figure 4-8. Speedups on the 4-way machine.
0
5
10
15
20
25
Basic scheme
Advanced scheme
compress gcc go ijpeg li m88ksim perl P
e
r
f
o
r
m
a
n
c
e
i
m
p
r
o
v
e
m
e
n
t
(
%
)
140
tively idle. Quantitatively, simulations of ijpeg show that the LdSt subsystem is idle 13.5%
of the cycles when the Comp subsystem is executing one or more instructions. The equiv-
alent number for m88ksim is only 4.4%. With the advanced partitioning scheme, m88ksim
also suffers from the problem of load imbalance. For m88ksim with the advanced scheme,
the LdSt subsystem is idle 12.4% of the cycles in which the Comp subsystem is executing
one or more instructions. This partly explains why performance only improves by about
2.6% even though the size of the partition increases by 12%.
Another reason performance might not improve with Comp partition size is that in some
cases the critical path of execution is not affected by partitioning. For example, with the
basic partitioning scheme, 15% of the code in mpegplay executes in the Comp subsystem,
but the resulting speedup is only 2.7%. Loads and stores contribute close to 47% of the
total instructions in the benchmarks, and hence performance is largely determined by the
cache bandwidth available. Since the integer-decoupled microarchitecture has the same
cache bandwidth as the conventional microarchitecture, the performance of mpegplay
does not improve signicantly. Even with the advanced partitioning scheme and a bigger
Comp partition, the speedup is only 4%.
The graph also shows that for most benchmarks, the advanced partitioning scheme
yields better speedups than the basic partitioning scheme. The two exceptions are li and
Figure 4-9. Speedups on the 8-way machine.
0
5
10
15
20
25
Basic scheme
Advanced scheme
compress gcc go ijpeg li m88ksim perl
P
e
r
f
o
r
m
a
n
c
e
i
m
p
r
o
v
e
m
e
n
t
(
%
)
141
m88ksim. In the case of li, the increase in the size of the Comp partition is very small. For
m88ksim, load imbalance seems to be the problem as mentioned earlier.
Performance Improvements on the 8-way machine
The graph in Figure 4-9 shows performance improvements on the 8-way issue (4 int + 4
fp) machine. The speedups on the 8-way issue machine are smaller than the speedups
achieved on the 4-way issue machine. This is expected because the number of units in the
LdSt subsystem now gets within the range of average parallelism in the programs. So, the
extra issue bandwidth available in the Comp subsystem is not exploited as much. How-
ever, m88ksim achieves an improvement of 19% because it has enough parallelism and is
able to exploit the presence of a bigger instruction window and the wider issue and execu-
tion bandwidth.
Instruction mix of the Comp partition
The instruction mix of the Comp partition, assuming that integer multiply and divide
operations are also available in the Comp subsystem, is shown in Figure 4-10. The graphs
shows that, except for ijpeg, all the benchmarks execute a negligible number of integer
multiply and divide operations in the Comp subsystem. Ijpeg has the maximum percentage
of multiplies at 2.77%. Ijpeg also has the maximum number of divides at 0.11%. For the
remaining benchmarks, the instruction mix is almost entirely composed of simple control,
logical, and arithmetic instructions. This observation matches with the results of other
studies [HP96].
For ijpeg, we studied the performance effects of supporting integer multiply and divide
operations in the Comp subsystem. This has a dramatic effect on the basic partitioning
scheme. The Comp percentage increased from 11% to 40%. The speedups also increased
from 6% to 16% because in some frequently executed functions of ijpeg, the multiply
instructions are closely related to the rest of the instructions in the function. So, when the
multiply instructions are moved to the LdSt subsystem, all reachable instructions are also
142
moved to the LdSt subsystem which effectively moves the whole function to LdSt. How-
ever, the change was not as marked with the advanced partitioning scheme because it was
able to recoup some of the computation that got moved to LdSt using copies. The Comp
partition size increased from 11% to 32%. The performance improvement on the 4-way
issue machine increased from 6% to 11%. This shows that the advanced partitioning
scheme is successful in reducing the impact of the absence of integer multiply and divide
instructions in the Comp subsystem.
4.7 Related Work
The early Control Data Corporation and Cray Research style of architectures [Rus78,
Tho61] were the rst to distinguish operand access and computation. One set of functional
units and registers is used for addressing and a second set is used for computation in these
architectures. Smith [Smi82] proposed the decoupled style of machine organization in
which operand access and computation are separated and executed in parallel. The access
subsystem executes memory access related instructions while the execute subsystem sup-
ports compute instructions. The access and execute subsystems communicate through
queues. This organization style permits the access subsystem to slip ahead of the execute
subsystem and hence, helps hide the latency of memory access. Experimental evaluation
showed considerable speedups for the oating-point programs studied. Work along similar
Figure 4-10. Instruction mix of the Comp partition.
0
10
20
30
40
50
60
70
80
90
100
Integer divs
Integer muls
Arithmetic
Logical
Control
P
e
r
c
e
n
t
o
f
C
o
m
p
i
n
s
t
r
u
c
t
i
o
n
s
(
%
)
compress gcc go ijpeg m88ksim li perl
143
lines is reported by Pleszkun and Davidson [PD83], Goodman et al. [GHL
+
85], and Bird
et al.[BRT93].
The decoupling concept has since been successfully implemented in a number of com-
mercial machines like the IBM RS/6000 [Gro90] and the MIPS R8000 [Hsu94]. However,
both these implementations only decouple integer and oating-point subsystems. While
this helps to decouple memory access and computation in oating-point programs, integer
programs cannot benet from decoupling in these implementations.
The work presented in this chapter extends earlier work in the area of decoupled archi-
tectures in two important ways. First, the proposed integer-decoupled microarchitecture
applies the concept of decoupling to integer programs. Second, decoupling is used as a
technique to extract additional performance for integer codes from conventional microar-
chitectures without increasing their complexity.
In the context of the compiler work presented, the most closely related work is reported
by Capitanio et al. [CDN92]. They study code partitioning for a VLIW architecture with
partitioned register les. Their architecture consists of a number of homogeneous clusters
each of which are statically scheduled. In contrast, the integer-decoupled microarchitec-
ture is heterogeneous; only the LdSt subsystem can execute loads and stores. Further, the
earlier study applied code partitioning only to straight-line loop bodies and did not con-
sider code duplication as a means of avoiding inter-partition communication.
4.8 Chapter Summary
Conventional microarchitectures suffer from idle oating-point resources when execut-
ing integer codes. This chapter proposed integer-decoupled microarchitectures that
address this drawback by supporting some of the non-addressing computation in integer
programs in an augmented oating-point subsystem. For integer programs, this provides
extra issue and execution bandwidth as well as provides a larger window for dynamic
scheduling without increasing the complexity of the conventional microarchitecture. Fur-
144
thermore, the only change required to the hardware is the implementation of simple inte-
ger operations in the oating-point subsystem.
The performance of the proposed microarchitecture was evaluated relative to a conven-
tional microarchitecture. The results show two things. First, for the benchmarks studied,
the compiler is able to off-load a signicant fraction, from 9% to 41%, of the total compu-
tation in integer programs to the augmented oating-point subsystem. Second, as a result
the performance improvements in the 3% to 23% range were achieved on a 4-way issue
processor.
Hence, I believe that the integer-decoupled microarchitecture is an attractive choice for
future processors especially considering that the hardware changes required to adapt the
conventional microarchitecture are small.
145
Chapter 5
Conclusions
This thesis examined the trade-off between hardware complexity and clock speed in the
design of superscalar microarchitectures. Using the results of the trade-off analysis, the
thesis proposed and evaluated two new superscalar microarchitectures designed with the
goal of achieving high performance by reducing complexity.
5.1 Thesis Summary
Superscalar microarchitectures provide high performance by using hardware techniques
to execute multiple instructions every cycle. The performance of these microarchitectures
is directly proportional to the product .
Instructions Per Cycle or IPC measures the amount of parallelism extracted by the
microarchitecture and Clock Frequency is the speed at which the microarchitecture can be
clocked. Complex hardware helps improve the IPC factor by extracting higher levels of
instruction-level parallelism. However, the complex hardware employed to achieve high
IPC can potentially slow the clock and hence, nullify the improvements in IPC. Therefore,
there is a need for developing microarchitectures that judiciously use hardware complexity
Instructions Per Cycle Clock Frequency
146
for extracting higher levels of parallelism while permitting a fast clock; that is, to develop
microarchitectures we refer to as complexity-effective microarchitectures.
To design microarchitectures that are complexity-effective, computer architects need
simple models for measuring complexity that can be used at a fairly early stage of the
design process. In addition to determining complexity-effectiveness, such models help
identify long-term complexity trends.
The rst part of this thesis presented simple models that quantifying the complexity of
superscalar microarchitectures. A baseline superscalar pipeline is presented and structures
whose complexity grows with increasing ILP are identied. Of these structures, register
renaming, instruction window wakeup, instruction window selection, register le access,
and operand bypassing are analyzed in detail. Each is modeled and Spice simulated for
three different feature sizes representing past, present, and future technologies. Simple
analytical models are developed that express the delay of each of the structures in terms of
microarchitectural parameters like issue width and instruction window size. The impact of
technology trends is also studied. In particular, the impact of poor scaling of wire delays in
future technologies is analyzed.
Results show that the logic associated with managing the issue window of a superscalar
processor is likely to become the most critical structure as we move towards wider-issue,
larger windows, and advanced technologies in which wire delays dominate. One of the
functions implemented by the logic is the broadcast of results tags over wires that span the
instruction window. This operation does not scale well especially as feature sizes are
reduced. Furthermore, in order to be able to execute dependent instructions in consecutive
cycles a desirable feature from the point of view of performance the delay of the
window logic should be less than a cycle.
In addition to window logic, a second structure that needs careful consideration in future
technologies is the data bypass logic. The length of result wires used to broadcast bypass
147
values increases linearly with issue width and hence, the delay of the data bypass logic
increases at least linearly with issue width. As a result, the data bypass delay can grow sig-
nicantly for wider microarchitectures in future technologies and force architects to con-
sider clustered microarchitectures.
To address the complexity of window logic and data bypass logic, a family of complex-
ity-effective microarchitectures called the dependence-based superscalar microarchitec-
tures is proposed and studied. The proposed microarchitectures achieve the dual goals of
high IPC and a fast clock using two main techniques. The machine is partitioned into mul-
tiple clusters each of which contains a slice of the instruction window and execution
resources of the whole processor. This enables high-speed clocking of the individual clus-
ters since the narrow issue width and the small instruction window in each cluster keeps
critical delays small. The second technique involves intelligent steering of instructions to
the multiple clusters so that the whole width of the machine is utilized while minimizing
the performance degradation due to slow inter-cluster communication. Experimental
results show that dependence-based superscalar microarchitectures are capable of extract-
ing similar levels of parallelism as conventional microarchitectures while facilitating a
faster clock.
The third contribution of this thesis is the integer-decoupled microarchitecture. The inte-
ger-decoupled microarchitecture improves the performance of integer programs and can
be integrated into a conventional microarchitecture with little or no increase in complexity.
Floating-point units in the conventional microarchitecture are augmented to perform sim-
ple integer operations and the resulting oating-point subsystem is used to support some
of the computation in integer programs. The computation to be off-loaded is identied by
the compiler. Simulation results are presented that show modest speedups for a 4-way pro-
cessor. The speedups diminish with increasing issue width.
148
5.2 Future directions
5.2.1 Quantifying the Complexity of Superscalar Microarchitectures
Analysis similar to that presented in this thesis can be applied to other structures in the
pipeline that are not studied here. Two specic examples are the instruction fetch logic and
the load/store queue logic. The complexity of the latter in particular has been problematic
[Yea97] for designers in industry.
5.2.2 Dependence-based Superscalar Microarchitectures
The instruction steering heuristics studied in this thesis are simple in that they do not
require more than one extra pipe stage. One avenue for future research is the feasibility
and applicability of caching steering information. Caching steering information can help
move the steering logic out of the critical path. This would open up the possibility of more
complex steering heuristics. Therefore, it might be worthwhile to study sophisticated
steering heuristics that can further boost the parallelism extracted by the dependence-
based microarchitectures.
The fo steering heuristic studied in this thesis steers instructions solely based on regis-
ter dependences between instructions. It might be possible to augment the heuristic with
the memory-dependence prediction techniques proposed by Moshovos et al. [MBVS97] to
help create longer chains. For example, a load instruction can be steered to the fo that
contains an earlier store instruction to the same address as the one referenced by the load.
Note that at the time of steering, the addresses referenced by the load and the store instruc-
tion are not known. Memory-dependence prediction can be used to chain dependent load-
store pairs and steer them to the same fo.
5.2.3 Integer-decoupled Microarchitecture
There is always scope for more research in developing improved partitioning heuristics
that can off-load more computation to the augmented FP subsystem. Another possibility is
149
to study heuristics that not only try to off-load sizable fraction of the total computation,
but also try to balance the load on the two subsystems.
An alternative scheme for utilizing the idle oating-point subsystem in a conventional
microarchitecture, is to use the idle subsystem to execute along both paths of likely
mispredicted branches [HS96] in integer programs. Of course, this would require extra
hardware support.
150
151
References
[ABHS89] M. C. August, G. M. Brost, C. C. Hsiung, and A. J. Schifeger. Cray X-MP:
The Birth of a Supercomputer. IEEE Computer, 22:4554, January 1989.
[ACR95] P. S. Ahuja, D. W. Clark, and A. Rogers. The Performance Impact of Incomplete
Bypassing in Processor Pipelines. In Proceedings of the 28th Annual International
Symposium on Microarchitecture, November 1995.
[AMG
+
95] C. Asato, R. Montoye, J. Gmuender, E. W. Simmons, A. Ike, and J. Zasio. A
14-Port 3.8ns 116-Word 64b Read-Renaming Register File. In 1995 IEEE Interna-
tional Sold-State Circuits Conference Digest of Technical Papers, pages 104105,
February 1995.
[AS92] T. M. Austin and G. S. Sohi. Dynamic Dependency Analysis of Ordinary Pro-
grams. In Proceedings of the 19th Annual International Symposium on Computer
Architecture, pages 342351, May 1992.
[Ass97] Semiconductor Industry Association. The National Technology Roadmap for
Semiconductors, 1997.
[AST67] D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo. The IBM System/360
Model 91: Machine Philosophy and Instruction-Handling. IBM Journal of Research
and Development, 11:824, January 1967.
[ASU88] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers : Principles, Techniques and
Tools. Addison Wesley, 1988.
[BAB96] D. Burger, T. M. Austin, and S. Bennett. Evaluating Future Microprocessors: the
SimpleScalar Tool Set. Technical Report CS-TR-96-1308 (Available from http://
www.cs.wisc.edu/trs.html), University of Wisconsin-Madison, July 1996.
[Boh95] M. T. Bohr. Interconnect Scaling - The Real Limiter to High Performance ULSI.
In 1995 International Electron Devices Meeting Technical Digest, pages 241244,
1995.
[BP92] M. Butler and Y. N. Patt. An Investigation of the Performance of Various Dynamic
Scheduling Techniques. In Proceedings of the 25th Annual International Sympo-
sium on Microarchitecture, pages 19, December 1992.
152
[Bre] S. E. Breach. Design and Evaluation of a Multiscalar Processor. Ph.D. thesis in
preparation at University of WisconsinMadison.
[BRT93] P. L. Bird, A. Rawsthorne, and N. P. Topham. The Effectiveness of Decoupling.
In Proceedings of the 7th ACM International Conference on Supercomputing, pages
4756, 1993.
[Buc62] W. Bucholtz. Planning a Computer System: Project Stretch. McGraw-Hill, 1962.
[CDN92] A. Capitanio, N. Dutt, and A. Nicolau. Partitioned Register FIles for VLIWs: A
Preliminary Analysis of Tradeoffs. In Proceedings of the 25th Annual International
Symposium on Microarchitecture, pages 292300, December 1992.
[Cha81] A. E. Charlesworth. An Approach to Scientic Array Processing: The Architec-
tural Design of the AP-120B/FPS-164 Family. IEEE Computer, 14(9):1827, 1981.
[Cha91] T. Chappell. A 2ns Cycle, 4 ns Access 512kb CMOS ECL SRAM. In 1991 IEEE
International Sold-State Circuits Conference Digest of Technical Papers, pages 50
51, February 1991.
[Cha95] J. I. Chamdani. Microarchitecture Techniques to Improve the Design of Supersca-
lar Microprocessors. PhD thesis, Georgia Institute of Technology, March 1995.
[CLL92] T. H. Cormen, C. E. Leiserson, and R. L.Rivest. Introduction to Algorithms.
McGraw Hill, 1992.
[CNO
+
88] R. P. Colwell, R. P. Nix, J. J. ODonnell, D. B. Papworth, and P. K. Rodman. A
VLIW Architecture for a Trace Scheduling Compiler. IEEE Transactions on Com-
puters, 37:9667979, August 1988.
[DF90] P. K. Dubey and M. J. Flynn. Optimal Pipelining. Journal of Parallel and Distrib-
uted Computing, 8:1019, 1990.
[Dig96] Digital Equipment Corporation. Alpha Architecture Handbook, Version 3, October
1996.
[DM74] J. B. Dennis and D. P. Misunas. A Preliminary Architecture for a Basic Dataow
Computer. In Proceedings of the 2nd Annual International Symposium on Computer
Architecture, pages 126132, 1974.
[D
+
74] R. Dennard et al. Design of Ion-implanted MOSFETs With Very Small Physical
Dimensions. IEEE Journal of Solid-State Circuits, SC-9:256268, 1974.
[D
+
92] D. Dobberpuhl et al. A 200-MHz 64-bit Dual-issue Microprocessor. IEEE Journal
of Solid-State Circuits, 27(11), November 1992.
153
[DT92] H. Dwyer and H. C. Torng. An Out-of-Order Superscalar Processor with Specula-
tive Execution and Fast, Precise Interrupts. In Proceedings of the 25th Annual Inter-
national Symposium on Microarchitecture, pages 272281, December 1992.
[Ell85] J. R. Ellis. Bulldog: A Compiler for VLIW Architectures. The MIT Press, Cam-
bridge, Massachussets, 1985.
[FCJV97] K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic. The Multicluster Architec-
ture: Reducing Cycle Time Through Partitioning. In Proceedings of the 30th Annual
International Symposium on Microarchitecture, pages 149159, December 1997.
[Fis81] J. A. Fisher. Trace Scheduling: A Technique For Global Microcode Compaction.
IEEE Transactions on Computers, C-30(7), July 1981.
[FJC96] K. I. Farkas, N. P. Jouppi, and P. Chow. Register File Design Considerations in
Dynamically Scheduled Processors. In Proceedings of the Second IEEE Symposium
on High-Performance Computer Architecture, February 1996.
[FJD80] S. Fuller, A. Jones, and I. Durham. CMU Cm* Review. Technical Report AD-
A050135, Department of Computer Science , Carnegie-Mellon University, 1980.
[Fra93] M. Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin
Madison, November 1993.
[FS92] M. Franklin and G. S. Sohi. The Expandable Split Window Paradigm for Exploit-
ing Fine-Grain Parallelism. In Proceedings of the 19th Annual International Sympo-
sium on Computer Architecture, pages 5869, May 1992.
[GHL
+
85] J. R. Goodman, J. T. Hsieh, K. Liou, A. R. Plezkun, P. B. Schechter, and H. C.
Young. PIPE: A VLSI Decoupled Architecture. In Proceedings of the 12th Annual
International Symposium on Computer Architecture, pages 2027, 1985.
[Gro90] G. F. Grohoski. Machine Organization of the IBM RISC System/6000 Processor.
IBM Journal of Research and Development, 34(1):3758, January 1990.
[G
+
97] B. A. Gieseke et al. A 600MHz Superscalar RISC Microprocessor With Out-Of-
Order Execution. In 1997 IEEE International Sold-State Circuits Conference Digest
of Technical Papers, pages 176177, February 1997.
[Gwe93] L. Gwennap. Speed Kills? Not for RISC Processors. Microprocessor Report,
7(3):3, March 1993.
[Gwe95a] L. Gwennap. Hal Reveals Multichip SPARC Processor. Microprocessor Report,
9(3), March 1995.
154
[Gwe95b] L. Gwennap. Intels P6 Uses Decoupled Superscalar Design. Microprocessor
Report, 9(2), February 1995.
[Gwe95c] L. Gwennap. UltraSparc Adds Multimedia Instructions. Microprocessor Report,
8(16), December 1995.
[Gwe96a] L. Gwennap. Digital 21264 Sets New Standard. Microprocessor Report,
10(14):1116, October 1996.
[Gwe96b] L. Gwennap. Intels MMX Speeds Multimedia. Microprocessor Report, 10(3),
March 1996.
[HF88] I. S. Hwang and A. L. Fisher. A 3.1ns 32b CMOS Adder in Multiple Output Dom-
ino Logic. In 1988 IEEE International Sold-State Circuits Conference Digest of
Technical Papers, pages 140141, February 1988.
[Hin95] G. Hinton. Pentium Pro Processor, December 1995. Tutorial talk at 28th Annual
International Symposium on Microarchitecture.
[HP86] W. W. Hwu and Y. N. Patt. HPSm, A High Performance Restricted Data Flow
Architecture Having Minimal Functionality. In Proceedings of the 13th Annual
International Symposium on Computer Architecture, pages 297307, June 1986.
[HP96] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative
Approach. Morgan Kaufmann, second edition, 1996.
[HPS92] M. Horowitz, S. Przybylski, and M. D. Smith. Recent Trends in Processor
Design: Reclimbing the Complexity Curve, August 1992. Tutorial talk at Western
Institute of Computer Science, Stanford University.
[HS96] T. H. Heil and J. E. Smith. Selective Dual Path Execution. Unpublished, University
of Wisconsin-Madison, November 1996.
[Hsu94] P. Y. T. Hsu. Design of the R8000 Microprocessor. IEEE Micro, pages 2333,
April 1994.
[I
+
95] Inoue et al. A 0.4um 1.4ns 32b Dynamic Adder Using Non-precharge Multiplexers
and Reduced Precharge Voltage Techniques. In 1995 Symposium on VLSI Circuits
Digest of Technical Papers, pages 910, June 1995.
[JJ90] M. G. Johnson and N. P. Jouppi. Transistor model for a Synthetic 0.8um CMOS pro-
cess, May 1990. Class notes for Stanford University EE371.
[Joh91] M. Johnson. Superscalar Microprocessor Design. Prentice-Hall, 1991.
155
[Jol91] R. D. Jolly. A 9-ns, 1.4-Gigabyte/s, 17-Ported CMOS Register File. IEEE Journal
of Solid-State Circuits, 26(10), October 1991.
[JW89] N. P. Jouppi and D. W. Wall. Available Instruction-Level Parallelism for Supersca-
lar and Superpipelined Machines. In Proceedings of the Third International Confer-
ence on Architectural Support for Programming Languages and Operating Systems,
April 1989.
[Kel96] J. Keller. The 21264: A Superscalar Alpha Processor with Out-of-Order Execu-
tion, October 1996. 9th Annual Microprocessor Forum, San Jose, California.
[KF96] G. A. Kemp and M. Franklin. PEWS: A Decentralized Dynamic Scheduler for ILP
Processing. In Proceedings of the International Conference on Parallel Processing,
volume I, pages 239246, 1996.
[KH92] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice Hall, 1992.
[KM89] L. Kohn and N. Margulis. Introducing the Intel i860 64-bit Microprocessor. In
IEEE Micro, pages 1530, August 1989.
[Kog81] P. M. Kogge. The Architecture of Pipelined Computers. McGraw-Hill, 1981.
[KS86] S. R. Kunkel and J. E. Smith. Optimal Pipelining in Supercomputers. In Proceed-
ings of the 13th Annual International Symposium on Computer Architecture, June
1986.
[K
+
93] P. Knebel et al. HPs PA7100LC: A Low-Cost Superscalar PA-RISC Processor. In
Proceedings of COMPCON, pages 441448, 1993.
[Kum96] A. Kumar. The HP-PA8000 RISC CPU: A High Performance Out-of-Order Pro-
cessor. In Proceedings of the Hot Chips VIII, pages 920, August 1996.
[LW92] M. S. Lam and R. P. Wilson. Limits of Control Flow on Parallelism. In Proceed-
ings of the 19th Annual International Symposium on Computer Architecture, pages
4657, May 1992.
[Mat97] D. Matzke. Will Physical Scalability Sabotage Performance Gains. IEEE Com-
puter, 30(9):3739, 1997.
[MBVS97] A. Moshovos, S. E. Breach, T. N. Vijaykumar, and G. S. Sohi. Dynamic Spec-
ulation and Synchronization of Data Dependences. In Proceedings of the 24th
Annual International Symposium on Computer Architecture, pages 181193, June
1997.
156
[McF93] S. McFarling. Combining Branch Predictors. Technical Report DEC WRL Tech-
nical Note TN-36, DEC Western Research Laboratory, 1993.
[Met87] Meta-Software Inc. HSpice Users Manual, June 1987.
[MF95] G. McFarland and M. Flynn. Limits of Scaling MOSFETs. Technical Report CSL-
TR-95-662 (Revised), Stanford University, November 1995.
[NH97] K. Nowka and H. P. Hofstee. Circuits and Microarchitecture for Gigahertz VLSI
Designs. In Proceedings of the 17th Conference on Advanced Research in VLSI,
pages 284287, September 1997.
[Now95] K. Nowka. High-Performance CMOS System Design Using Wave Pipelining.
PhD thesis, Stanford University, August 1995.
[PD83] A. R. Pleszkun and E. S. Davidson. Structured Memory Access Architecture. In
Proceedings of the International Conference on Parallel Processing, pages 461
471, 1983.
[PS81] D. A. Patterson and C. H. Sequin. RISC I: A reduced instruction set VLSI com-
puter. In Proceedings of the 8th Annual International Symposium on Computer
Architecture, May 1981.
[Rab96] J. M. Rabaey. Digital Integrated Circuits - A Design Perspective. Prentice Hall
Electronics and VLSI Series, 1996.
[RBS96] E. Rotenberg, S. Bennet, and J. E. Smith. Trace Cache: a Low Latency Approach
to High Bandwidth Instruction Fetching. In Proccedings of the 29th Annual Interna-
tional Symposium on Microarchitecture, December 1996.
[RF72] E. M. Riseman and C. C. Foster. The inhibition of potential parallelism by condi-
tional jumps. IEEE Transactions on Computers, C-21(12):14051411, December
1972.
[RJSS97] E. Rotenberg, Q. Jacobson, Q. Sazeides, and J. E. Smith. Trace Processors. In
Proceedings of the 30th Annual International Symposium on Microarchitecture,
pages 138148, December 1997.
[RNOM95] K. Rahmat, O. S. Nakagawa, S-Y. Oh, and J. Moll. A Scaling Scheme for
Interconnect in Deep-Submicron Processes. Technical Report HPL-95-77, Hewlett-
Packard Laboratories, July 1995.
[Rus78] R. M. Russell. The CRAY-1 Computer System. Communications of the ACM,
21(1):6372, January 1978.
157
[RYYT89] B. R. Rau, D. W. L. Yen, W. Yen, and R. Towle. The Cydra 5 Departmental
Supercomputer: Design Philosophies, Decisions, and Trade-offs. IEEE Computer,
22:1235, January 1989.
[SBV95] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar Processors. In Pro-
ceedings of the 22nd Annual International Symposium on Computer Architecture,
pages 414425, June 1995.
[Sch71] H. Schorr. Design Principles for a High-Performance System. In Symposium on
Computers and Automata, Polytechnic Institute of Brooklyn, pages 165192, 1971.
[SDC95] S. P. Song, M. Denman, and J. Chang. The PowerPC 604 RISC Microprocessor.
In IEEE Micro, pages 817, October 1995.
[SF91] G. S. Sohi and M. Franklin. High-Bandwidth Data Memory Systems for Supersca-
lar Processors. In Proceedings of the Fourth International Conference on Architec-
tural Support for Programming Languages and Operating Systems, pages 5362,
April 1991.
[Smi82] J. E. Smith. Decoupled Access/Execute Computer Architecture. In Proceedings of
the 9th Annual International Symposium on Computer Architecture, pages 112119,
April 1982.
[Smi95] J. E. Smith. New Paradigms for Instruction Level Parallelism, October 1995. Talk
prepared at University of WisconsinMadison.
[Smi97] J. E. Smith. Amdahls Law: Not Just an Equation, June 1997. Keynote speech at
the 24th Annual International Symposium on Computer Architecture.
[Soh90] G. S. Sohi. Instruction Issue Logic for High-Performance, Interruptible, Multiple
Functional Units, Pipelined Computers. IEEE Transactions on Computers, 39:349
359, March 1990.
[SP88] J. E. Smith and A. R. Pleszkun. Implementing Precise Interrupts in Pipelined Pro-
cessors. IEEE Transactions on Computers, 37:562573, May 1988.
[SPS98] S. Sastry, S. Palacharla, and J. E. Smith. Exploiting Idle Floating-Point Resources
for Integer Execution. In Proceedings of the SIGPLAN Conference on Programming
Language Design and Implementation, June 1998.
[SS90] J. E. Smith and G. S. Sohi. Studies in Program Characteristics and Architectural
Choices for High-Performance, Fine-Grain Parallel Processors, 1990. Grant pro-
posal prepared at University of WisconsinMadison.
158
[SS95] J. E. Smith and G. S. Sohi. The Microarchitecture of Superscalar Processors. Pro-
ceedings of the IEEE, December 1995.
[S
+
87] J. E. Smith et al. The ZS-1 Central Processor. In Proceedings of the 2nd Interna-
tional Conference on Architectural Support for Programming Languages and Oper-
ating Systems, pages 199204, October 1987.
[S
+
91] H. Shinohara et al. A Flexible Multiport RAM Compiler for Data Path. IEEE Jour-
nal of Solid-State Circuits, 26(3), March 1991.
[S
+
93] M. Suzuki et al. A 1.4ns 32b CMOS ALU in Double Pass-Transistor Logic. IEEE
Journal of Solid-State Circuits, 28(11), November 1993.
[TF70] G. S. Tjaden and M. J. Flynn. Detection and Parallel Execution of Independent
Instructions. IEEE Transactions on Computers, C-19:889895, October 1970.
[Tho61] J. E. Thornton. Parallel Operation in the Control Data 6600. In Proceedings of the
Fall Joint Computers Conference, volume 26, pages 3340, 1961.
[Tho63] J. E. Thornton. Considerations in Computer Design Leading up to the CON-
TROL DATA 6600, 1963. Contol Data Chippewa Laboratory Report.
[Tom67] R. M. Tomasulo. An Efcient Algorithm for Exploiting Multiple Arithmetic
Units. IBM Journal of Research and Development, 11:2533, January 1967.
[T
+
96] D. M. Tullsen et al. Exploiting Choice: Instruction Fetch and Issue on an Imple-
mentable Simultaneous Multithreading Processor. In Proceedings of the 23rd
Annual International Symposium on Computer Architecture, pages 191202, May
1996.
[Unk79] Unknown. CRAY-2 Central Processor, 1979. Unpublished Cray Research Report.
[VM97] S. Vajapeyam and T. Mitra. Improving Superscalar Instruction Dispatch and Issue
by Exploiting Dynamic Code Sequences. In Proceedings of the 24th Annual Inter-
national Symposium on Computer Architecture, pages 112, June 1997.
[V
+
96] N. Vasseghi et al. 200 MHz Superscalar RISC Processor Circuit Design Issues. In
1996 IEEE International Sold-State Circuits Conference Digest of Technical Papers,
pages 356357, February 1996.
[WE93] N. Weste and K. Eshraghian. Principles of CMOS VLSI Design. Addison Wesley,
second edition, 1993.
159
[Wei84] Mark Weiser. Program Slicing. IEEE Transactions on Software Engineering,
10(4):352357, July 1984.
[Wil51] M. V. Wilkes. The Best Way to Design an Automatic Calculating Machine. In Pro-
ceedings of the Manchester University Computer Inaugural Conference, pages 16
18, July 1951.
[Wil95] N. C. Wilhelm. Why Wire Delays Will No Longer Scale for VLSI Chips. Techni-
cal Report SMLI TR-95-44, Sun Microsystems Laboratories, August 1995.
[WJ94] S. J. E. Wilton and N. P. Jouppi. An Enhanced Access and Cycle Time Model for
On-Chip Caches. Technical Report 93/5, DEC Western Research Laboratory, July
1994.
[WO95] K. M. Wilson and K. Olukotun. High Performance Cache Architectures to Sup-
port Dynamic Superscalar Microprocessors. Technical Report CSL-TR-95-682,
Stanford University, June 1995.
[WRP92] T. Wada, S. Rajan, and S. A. Przybylski. An Analytical Access Time Model for
On-Chip Cache Memories. IEEE Journal of Solid-State Circuits, 27(8):11471156,
August 1992.
[Yea96] K. C. Yeager. Mips R10000 Superscalar Microprocessor. In IEEE Micro, April
1996.
[Yea97] K. C. Yeager, October 1997. Personal Communication.
[YP92] T. Y. Yeh and Y. N. Patt. Alternate Implementations of Two-Level Adaptive Train-
ing Branch Prediction. In Proceedings of the 19th Annual International Symposium
on Computer Architecture, pages 124134, May 1992.
160
161
Appendix A
A.1 Technology Parameters
The Hspice Level 3 models used to simulate the synthetic 0.8m, 0.35m, and 0.18m
CMOS technologies are given in Table A.1.
Parameter 0.8m 0.35m 0.18m
tox 165 70 35
vto 0.77(-0.87) 0.67(-0.77) 0.55(-0.55)
uo 570(145) 535(122) 450(80)
gamma 0.8(0.73) 0.53(0.42) 0.40(0.32)
vmax 2.7e5(0.0) 1.8e5(0.0) 1.05e5(0.0)
theta 0.404(0.233) 0.404(0.233) 0.404(0.233)
eta 0.04(0.028) 0.024(0.018) 0.008(0.008)
kappa 1.2(0.04) 1.2(0.04) 1.2(0.04)
phi 0.90 0.90 0.90
nsub 8.8e16(9.0e16) 1.38e17(1.38e17) 4.07e17(4.07e17)
nfs 4e11 4e11 4e11
xj 0.2 0.2 0.2
cj 2e-4(5e-4) 5.4e-4(9.3e-4) 10.6e-4(21.3e-4)
mj 0.389(0.420) 0.389(0.420) 0.389(0.420)
cjsw 4e-10 1.5e-10 3.0e-11
mjsw 0.26(0.31) 0.26(0.31) 0.26(0.31)
pb 0.80 0.80 0.80
cgso 2.1e-10(2.7e-10) 1.8e-10(2.4e-10) 1.8e-10(2.4e-10)
cgdo 2.1e-10(2.7e-10) 1.8e-10(2.4e-10) 1.8e-10(2.4e-10)
delta 0.0 0.0 0.0
ld 0.0001 0.0001 0.0001
rsh 0.5 0.5 0.5
Vdd 5.0 2.5 2.0
Table A.1: Spice parameters.
162
Table A.2 gives the metal resistance and capacitance values assumed for the three tech-
nologies.
A.2 Delay Results
Technology
R
metal
(/m)
C
metal
(fF/m)
0.8m 0.02 0.275
0.35m 0.046 0.628
0.18m 0.09 1.22
Table A.2: Metal resistance and capacitance.
Issue
Width
Decoder
Delay (ps)
Wordline Drive
Delay(ps)
Bitline
Delay(ps)
Total
Delay(ps)
0.8m technology
2 540.3 218.9 498.2 1502.2
4 547.1 227.9 529.6 1566.9
8 562.5 245.8 594.2 1700.9
0.35m technology
2 220.2 95.6 236.5 649.4
4 225.8 103.9 259.2 698.5
8 243.1 115.8 303.1 800.8
0.18m technology
2 129.6 70.6 175.7 435.4
4 136.8 78.2 193.4 478.9
8 148.4 92.5 228.5 561.7
Table A.3: Break down of rename delay.
163
Window
Size
Tag Drive
Delay(ps)
Tag Match
Delay(ps)
Match OR
Delay(ps)
Total
Delay(ps)
Issue Width = 2
8 73.0 331.3 248.1 652.4
16 82.6 333.1 248.5 664.2
24 92.6 337.3 248.8 678.7
32 103.7 344.0 249.1 696.9
40 114.9 347.7 248.9 711.5
48 126.3 352.4 248.7 727.5
56 137.4 358.7 249.2 745.4
64 149.1 364.6 248.7 762.4
Issue Width = 4
8 74.5 368.2 407.0 849.7
16 86.4 372.4 406.8 865.6
24 98.8 377.6 403.9 880.3
32 112.3 384.8 409.2 906.2
40 126.2 392.3 408.7 927.2
48 140.6 400.1 404.2 944.9
56 156.3 409.0 404.1 969.4
64 172.4 416.9 403.3 992.7
Issue Width = 8
8 77.5 400.2 665.3 1143.0
16 93.3 406.6 665.7 1165.5
24 111.4 415.2 664.8 1191.4
32 130.7 425.2 658.5 1214.4
40 151.5 437.7 660.2 1249.5
48 174.4 451.0 658.3 1283.8
56 199.3 465.0 664.6 1328.9
64 228.2 479.2 664.6 1372.0
Table A.4: Break down of window wakeup delay for 0.8m technology.
164
Window
Size
Tag Drive
Delay(ps)
Tag Match
Delay(ps)
Match OR
Delay(ps)
Total
Delay(ps)
Issue Width = 2
8 28.5 126.1 101.3 255.8
16 33.4 128.7 101.5 263.7
24 38.3 129.1 101.2 268.6
32 43.7 133.2 97.3 274.1
40 49.7 136.3 101.2 287.3
48 53.1 138.8 97.4 289.3
56 58.9 142.7 101.1 302.8
64 64.4 145.0 98.9 308.3
Issue Width = 4
8 29.7 147.1 155.8 332.6
16 36.0 151.2 158.3 345.4
24 42.7 155.0 159.1 356.8
32 50.5 157.7 158.4 366.7
40 56.3 163.2 159.0 378.5
48 63.2 168.1 159.6 390.9
56 72.0 171.9 157.0 400.9
64 80.9 179.0 159.1 419.0
Issue Width = 8
8 32.2 173.4 257.6 463.2
16 41.6 177.5 257.8 476.9
24 51.1 183.7 257.8 492.5
32 61.9 190.6 257.7 510.1
40 74.7 199.1 257.7 531.5
48 88.8 208.9 257.6 555.3
56 102.9 216.4 258.4 577.7
64 121.8 224.8 258.4 605.0
Table A.5: Break down of window wakeup delay for 0.35m technology.
165
Window
Size
Tag Drive
Delay(ps)
Tag Match
Delay(ps)
Match OR
Delay(ps)
Total
Delay(ps)
Issue Width = 2
8 14.6 67.9 60.7 143.1
16 18.8 68.7 60.6 148.1
24 22.4 69.8 60.6 152.7
32 26.1 71.8 60.6 152.7
40 29.9 73.6 60.3 163.8
48 33.7 75.7 59.9 169.3
56 36.6 77.3 61.0 174.8
64 41.4 79.4 59.7 180.5
Issue Width = 4
8 15.8 84.1 84.7 184.7
16 21.1 85.1 84.4 190.6
24 26.1 87.6 84.8 198.5
32 31.2 90.8 84.3 206.3
40 36.6 93.3 84.8 214.7
48 41.7 96.5 84.4 222.5
56 47.5 99.4 84.8 231.8
64 54.1 102.8 84.4 241.3
Issue Width = 8
8 18.8 104.9 123.6 247.3
16 26.1 108.4 123.8 258.3
24 33.8 113.6 123.1 270.5
32 42.0 118.2 125.0 285.1
40 51.5 124.8 123.2 299.5
48 62.6 130.4 123.0 316.0
56 75.1 135.2 123.2 333.4
64 90.0 139.4 122.9 352.3
Table A.6: Break down of window wakeup delay for 0.18m technology.
166
Window
Size
T
reqpropd
(ps) T
root
(ps) T
grantpropd
(ps)
Total
Delay(ps)
0.8m technology
16 233.2 607.2 272.5 1113.0
32 532.5 737.6 727.4 1997.5
64 534.6 742.9 719.8 1997.4
128 802.8 753.4 1118.5 2674.6
0.35m technology
16 125.0 338.5 135.4 598.9
32 246.6 339.7 295.4 881.7
64 245.5 338.0 296.3 879.8
128 347.9 338.5 460.3 1146.7
0.18m technology
16 53.6 141.7 55.1 250.4
32 107.0 141.2 123.5 371.7
64 106.9 144.2 121.9 373.0
128 159.9 146.7 195.5 502.1
Table A.7: Break down of selection delay.
167
Issue
Width
Window
Size
Register
File Size
Rename
Delay(ps)
Window
Delay(ps)
Register
File
Delay(ps)
Data
Bypass
Delay(ps)
2 16 48 1374.57 1777.20 1902.05 233.15
4 32 80 1417.25 2903.70 2222.10 411.12
8 64 120 1489.91 3369.4 2715.71 836.79
Table A.8: Overall delay results for 0.8m technology.
Issue
Width
Window
Size
Register
File Size
Rename
Delay(ps)
Window
Delay(ps)
Register
File
Delay(ps)
Data
Bypass
Delay(ps)
2 16 48 524.76 862.60 724.43 110.45
4 32 80 554.08 1248.40 873.21 223.79
8 64 120 603.59 1484.80 1155.45 486.50
Table A.9: Overall delay results for 0.35m technology.
Issue
Width
Window
Size
Register
File Size
Rename
Delay(ps)
Window
Delay(ps)
Register
File
Delay(ps)
Data
Bypass
Delay(ps)
2 16 48 285.43 398.50 393.43 91.00
4 32 80 311.55 578.00 498.29 177.58
8 64 120 355.62 725.30 729.40 421.42
Table A.10: Overall delay results for 0.18m technology.
168
169
Appendix B
The constants in the delay equations presented in Chapter 2 are tabulated below. The
table entries contain both absolute and relative values of the constants. The relative values
are presented to show how each components contribution varies with feature size.
B.1 Register Rename Logic
Decoder delay
Wordline delay
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
0.8m 387.16
(1.00)
8.611
(2.22e-2)
1.07e-2
(2.76e-5)
0.35m 153.66
(1.00)
5.425
(3.53e-2)
1.07e-2
(6.96e-5)
0.18m 81.88
(1.00)
3.96
(4.84e-2)
1.07e-2
(1.31e-5)
Table B.1: Constants in decoder delay equation for rename logic.
T
decoder
c
0
c
1
IW c
2
IW
2
+ + =
T
wordline
c
0
c
1
IW c
2
IW
2
+ + =
170
Bitline delay
Total delay
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
0.8m 98.71
(1.00)
7.17
(7.26e-2)
1.93e-3
(1.96e-5)
0.35m 39.18
(1.00)
4.52
(1.15e-1)
1.93e-3
(4.92e-5)
0.18m 20.88
(1.00)
3.30
(1.58e-1)
1.93e-3
(9.24e-5)
Table B.2: Constants in wordline delay equation for rename logic.
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
0.8m 525.75
(1.00)
22.06
(4.19e-2)
5.84e-3
(1.11e-2)
0.35m 208.67
(1.00)
13.90
(6.67e-2)
5.84e-3
(2.80e-2)
0.18m 111.20
(1.00)
10.14
(9.12e-2)
5.84e-3
(5.25e-2)
Table B.3: Constants in bitline delay equation for rename logic.
T
bitline
c
0
c
1
IW c
2
IW
2
+ + =
T
rename
c
0
c
1
IW c
2
IW
2
+ + =
171
B.2 Window Wakeup Logic
Tag drive delay
Tag match delay
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
0.8m 1011.62
(1.00)
37.84
(3.74e-2)
1.78e-2
(1.76e-5)
0.35m 401.51
(1.00)
23.84
(5.94e-2)
1.78e-2
(4.43e-5)
0.18m 213.96
(1.00)
17.40
(8.13e-2)
1.78e-2
(8.32e-5)
Table B.4: Constants in total delay equation for rename logic.
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
c
3
(ps)
c
4
(ps)
c
5
(ps)
0.8m 18.14
(1.00)
6.37e-1
(3.51e-2)
9.43e-2
(5.20e-3)
3.05e-3
(1.68e-4)
1.52e-3
(8.38e-5)
1.21e-4
(6.67e-6)
0.35m 7.20
(1.00)
2.97e-1
(4.12e-2)
5.94e-2
(8.25e-3)
2.10e-3
(2.92e-4)
1.29e-3
(1.79e-4)
1.21e-4
(1.68e-5)
0.18m 3.84
(1.00)
1.82e-1
(4.74e-2)
4.34e-2
(1.13e-2)
1.66e-3
(4.32e-4)
1.08e-3
(2.81e-4)
1.21e-4
(3.15e-5)
Table B.5: Constants in tag drive delay equation for wakeup logic.
T
tagdrive
c
0
c
1
c
2
IW + ( ) WINSIZE c
3
c
4
IW c
5
IW
2
+ + ( ) WINSIZE
2
+ + =
T
tagmatch
c
0
c
1
IW c
2
IW
2
+ + =
172
Match OR delay
Total delay
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
0.8m 390.68
(1.00)
6.01
(1.54e-2)
3.35e-3
(8.57e-6)
0.35m 83.15
(1.00)
3.48
(4.18e-2)
3.35e-3
(4.03e-5)
0.18m 45.46
(1.00)
2.55
(5.61e-2)
3.35e-3
(7.37e-5)
Table B.6: Constants in tag match delay equation for wakeup logic.
Feature
Size
c
0
(ps)
c
1
(ps)
0.8m 60.00 70.00
0.35m 26.25 30.62
0.18m 13.63 15.75
Table B.7: Constants in match OR delay equation for wakeup
logic.
=
+
+
T
matchOR
c
0
c
1
IW + =
T
wakeup
c
0
c
1
IW c
2
IW
2
+ + ( )
c
3
c
4
IW + ( ) WINSIZE
c
5
c
6
IW c
7
IW
2
+ + ( ) WINSIZE
2
173
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
c
3
(ps)
c
4
(ps)
c
5
(ps)
0.8m 468.82
(1.00)
76.01
(1.62e-1)
3.35e-3
(7.14e-6)
6.37e-3
(1.36e-5)
9.43e-2
(2.01e-4)
3.05e-3
(6.50e-6)
0.35m 116.60
(1.00)
34.10
(2.91e-1)
3.35e-3
(2.87e-5)
2.97e-1
(2.55e-3)
5.94e-2
(5.09e-4)
2.10e-3
(1.80e-5)
0.18m 62.93
(1.00)
18.30
(2.91e-1)
3.35e-3
(5.32e-5)
1.82e-1
(2.89e-3)
4.34e-2
(6.90e-4)
1.66e-3
(2.64e-5)
Feature Size
c
6
(ps)
c
7
(ps)
0.8m 1.52e-3
(3.24e-6)
1.21e-4
(2.58e-7)
0.35m 1.29e-3
(1.11e-5)
1.21e-4
(1.04e-6)
0.18m 1.07e-3
(1.70e-5)
1.21e-4
(1.92e-6)
Table B.8: Constants in total delay equation for wakeup logic.
174
B.3 Window Selection Logic
B.4 Register File Logic
Decoder delay
Wordline delay
Feature
Size
c
0
(ps)
c
1
(ps)
0.8m 127.61 322.51
0.35m 50.65 128.00
0.18m 26.99 68.21
Table B.9: Constants in total delay equation for selection logic.
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
c
3
(ps)
c
4
(ps)
c
5
(ps)
0.8m 414.62
(1.00)
9.63e-2
(2.32e-4)
2.03e-2
(4.89e-5)
1.94e-6
(4.68e-9)
4.37e-6
(1.05e-8)
2.46e-6
(5.93e-9)
0.35m 164.56
(1.00)
6.06e-2
(3.68e-4)
2.03e-2
(1.23e-4)
1.94e-6
(1.18e-8)
4.37e-6
(2.65e-8)
2.46e-6
(1.49e-8)
0.18m 87.69
(1.00)
4.43e-2
(5.05e-4)
2.03e-2
(2.31e-4)
1.94e-6
(2.21e-8)
4.37e-6
(4.98e-8)
2.46e-6
(2.80e-8)
Table B.10: Constants in decoder delay equation for register le logic.
T
select
c
0
c
1
WINSIZE
4
log + =
T
decoder
c
0
c
1
c
2
IW + ( ) NPREG c
3
c
4
IW c
5
IW
2
+ + ( ) NPREG
2
+ + =
T
wordline
c
0
c
1
IW c
2
IW
2
+ + =
175
Bitline delay
Total delay
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
0.8m 203.92
(1.00)
49.66
(2.43e-1)
1.61e-1
(7.90e-4)
0.35m 80.94
(1.00)
31.29
(3.86e-1)
1.61e-1
(1.99e-3)
0.18m 43.13
(1.00)
22.84
(5.30e-1)
1.61e-1
(3.73e-3)
Table B.11: Constants in wordline delay equation for register le logic.
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
c
3
(ps)
c
4
(ps)
c
5
(ps)
0.8m 300.00
(1.00)
1.02
(3.40e-3)
2.54e-1
(8.47e-4)
1.02e-5
(3.40e-8)
1.40e-5
(4.67e-8)
2.85e-6
(9.50e-9)
0.35m 119.07
(1.00)
4.05e-1
(3.40e-3)
1.60e-1
(1.34e-3)
6.40e-6
(5.37e-8)
8.80e-6
(7.39e-8)
2.85e-6
(2.39e-8)
0.18m 63.45
(1.00)
2.96e-1
(4.66e-3)
1.17e-1
(1.84e-3)
4.68e-6
(7.38e-6)
6.42e-6
(1.01e-7)
2.85e-6
(4.49e-6)
Table B.12: Constants in bitline delay equation for register le logic.
=
+
+
T
bitline
c
0
c
1
c
2
IW + ( ) NPREG c
3
c
4
IW c
5
IW
2
+ + ( ) NPREG
2
+ + =
T
regfile
c
0
c
1
IW c
2
IW
2
+ + ( )
c
3
c
4
IW + ( ) NPREG
c
5
c
6
IW c
7
IW
2
+ + ( ) NPREG
2
176
B.5 Data Bypass Logic
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
c
3
(ps)
c
4
(ps)
c
5
(ps)
0.8m 918.53
(1.00)
49.66
(5.41e-2)
1.61e-1
(1.75e-4)
1.12
(1.22e-3)
2.74e-1
(2.98e-4)
1.21e-5
(1.32e-8)
0.35m 364.57
(1.00)
31.29
(8.58e-2)
1.61e-1
(4.42e-4)
4.65e-1
(1.28e-3)
1.80e-1
(4.94e-4)
8.34e-6
(2.29e-8)
0.18m 194.27
(1.00)
22.84
(1.18e-2)
1.61e-1
(8.29e-4)
3.40e-1
(1.75e-3)
1.37e-1
(7.05e-4)
6.62e-6
(3.41e-8)
Feature Size
c
6
(ps)
c
7
(ps)
0.8m 1.84e-5
(2.00e-8)
5.31e-6
(5.78e-9)
0.35m 1.32e-5
(3.62e-8)
5.31e-6
(1.46e-8)
0.18m 1.08e-5
(5.56e-8)
5.31e-6
(2.73e-8)
Table B.13: Constants in total delay equation for register le logic.
Feature
Size
c
0
(ps)
c
1
(ps)
c
2
(ps)
0.8m 18.13
(1.00)
25.50
(1.41)
6.15
(0.34)
0.35m 7.20
(1.00)
16.06
(2.23)
6.15
(0.85)
0.18m 3.84
(1.00)
11.72
(3.06)
6.15
(1.60)
Table B.14: Constants in total delay equation for data bypass logic.
T
bypass
c
0
c
1
IW c
2
IW
2
+ + =