0% found this document useful (0 votes)
23 views

Pipeline 2

Static pipelining is a technique that aims to provide the performance benefits of instruction pipelining in a more energy-efficient manner. With static pipelining, the control signals for each portion of the processor are explicitly encoded in each instruction rather than dynamically controlled by pipeline registers in hardware. This allows the compiler to statically schedule instructions across cycles to avoid inefficiencies in traditional pipelining like unnecessary register accesses and branch prediction. Preliminary results indicate static pipelining can significantly reduce power consumption without impacting performance.

Uploaded by

Lavanya YM. 46
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Pipeline 2

Static pipelining is a technique that aims to provide the performance benefits of instruction pipelining in a more energy-efficient manner. With static pipelining, the control signals for each portion of the processor are explicitly encoded in each instruction rather than dynamically controlled by pipeline registers in hardware. This allows the compiler to statically schedule instructions across cycles to avoid inefficiencies in traditional pipelining like unnecessary register accesses and branch prediction. Preliminary results indicate static pipelining can significantly reduce power consumption without impacting performance.

Uploaded by

Lavanya YM. 46
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228396491

An Overview of Static Pipelining

Article in IEEE Computer Architecture Letters · January 2012


DOI: 10.1109/L-CA.2011.26

CITATIONS READS
11 1,886

4 authors, including:

Ian Finlayson Gary S. Tyson


University of Mary Washington Florida State University
7 PUBLICATIONS 28 CITATIONS 112 PUBLICATIONS 2,534 CITATIONS

SEE PROFILE SEE PROFILE

Gang-Ryung Uh
Boise State University
28 PUBLICATIONS 225 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Static Pipelining View project

Exposed Datapath Architectures View project

All content following this page was uploaded by Gary S. Tyson on 03 June 2014.

The user has requested enhancement of the downloaded file.


An Overview of Static Pipelining
Ian Finlaysony , Gang-Ryung Uhz , David Whalleyy and Gary Tysony
y Department of Computer Science z Department of Computer Science
Florida State University Boise State University
ffinlayso, whalley, [email protected] [email protected]

Abstract—A new generation of mobile applications requires reduced introduces branch and data hazards. Branch hazards result in either
energy consumption without sacrificing execution performance. In this stalls for every branch, or the need for branch predictors and delays
paper, we propose to respond to these conflicting demands with an
when branches are mis-predicted. Data hazards result in the need for
innovative statically pipelined processor supported by an optimizing
compiler. The central idea of the approach is that the control during forwarding logic which leads to unnecessary register file accesses.
each cycle for each portion of the processor is explicitly represented in Experiments with SimpleScalar [2] running the MiBench benchmark
each instruction. Thus the pipelining is in effect statically determined by suite [8] indicate that 27.9% of register reads are unnecessary because
the compiler. The benefits of this approach include simpler hardware the values will be replaced from forwarding. Additionally 11.1% of
and that it allows the compiler to perform optimizations that are not
possible on traditional architectures. The initial results indicate that static register writes are not needed due to their only consumers getting
pipelining can significantly reduce power consumption without adversely the values from forwarding instead. Because register file energy
affecting performance. consumption is a significant portion of processor energy, these unnec-
essary accesses are quite wasteful [11] [9]. Additional inefficiencies
found in traditional pipelines include repeatedly calculating branch
I. I NTRODUCTION
targets when they do not change, reading registers whether or not
With the prevalence of embedded systems, energy consumption has they are used for the given type of instruction, and adding an offset
become an important design constraint. As these embedded systems to a register to form a memory address even when that offset is zero.
become more sophisticated, however, they also require a greater The goal of static pipelining is to avoid such inefficiencies while not
degree of performance. One of the most widely used techniques sacrificing the performance gains associated with pipelining.
for increasing processor performance is instruction pipelining, which Figure 1 illustrates the basic idea of our approach. With traditional
allows for increased clock frequency by reducing the amount of work pipelining, instructions spend several cycles in the pipeline. For
that needs to be performed for an instruction in each clock cycle. example, the sub instruction in Figure 1(b) requires one cycle for
The way pipelining is traditionally implemented, however, results each stage and remains in the pipeline from cycles four through seven.
in several areas of inefficiency with respect to energy consumption Each instruction is fetched and decoded and information about the
such as unnecessary register file accesses, checking for forwarding instruction flows through the pipeline, via the pipeline registers, to
and hazards when they cannot occur, latching unused values between control each portion of the processor that will take a specific action
pipeline registers and repeatedly calculating invariant values such as during each cycle. Figure 1(c) illustrates how a statically pipelined
branch target addresses. processor operates. Data still passes through the processor in multiple
In this paper, we present an overview of a technique called static cycles, but how each portion of the processor is controlled during each
pipelining [6] which aims to provide the performance benefits of cycle is explicitly represented in each instruction. Thus instructions
pipelining in a more energy-efficient manner. With static pipelining, are encoded to cause simultaneous actions to be performed that are
the control for each portion of the processor is explicitly represented normally associated with separate pipeline stages. For example, at
in each instruction. Instead of pipelining instructions dynamically in cycle 5, all portions of the processor, are controlled by a single
hardware, it is done statically by the optimizing compiler. There are instruction (depicted with the shaded box) that was fetched the
several benefits to this approach. First, energy consumption is re- previous cycle. In effect the pipelining is determined statically by
duced by avoiding unnecessary actions found in traditional pipelines. the compiler as opposed to dynamically by the hardware.
Secondly, static pipelining gives more control to the compiler which Figure 2 depicts one possible datapath of a statically pipelined
allows for more fine-grained optimizations for both performance and processor. The fetch portion of the processor is essentially unchanged
power. Lastly, a statically pipelined processor has simpler hardware from the conventional processor. Instructions are still fetched from the
than a traditional processor which can potentially provide a lower instruction cache and branches are predicted by a branch predictor.
production cost. The rest of the processor, however, is quite different. Because
This paper is structured as follows: Section 2 introduces a static statically pipelined processors do not need to break instructions into
pipelining architecture. Section 3 discusses compiling and optimizing multiple stages, there is no need for pipeline registers. In their place
statically pipelined code. Section 4 gives preliminary results. Section are a number of internal registers. Unlike pipeline registers, these
5 reviews related work. Section 6 discusses future work. Lastly, internal registers are explicitly read and written by the instructions,
Section 7 draws conclusions. and can hold their values across multiple cycles.
There are ten internal registers. The RS1 and RS2 registers are
II. S TATICALLY P IPELINED A RCHITECTURE used to hold values read from the register file. The LV register is
Instruction pipelining is commonly used to improve processor used to hold values loaded from the data cache. The SEQ register is
performance, however it also introduces some inefficiencies. First is used to hold the address of the next sequential instruction at the time
it is written, which is used to store the target of a branch in order to
the need to latch all control signals and data values between pipeline
stages, even when this information is not needed. Pipelining also avoid calculating the target address. The SE register is used to hold
a sign-extended immediate value. The ALUR and TARG registers are
1 Manuscript submitted: 11-Jun-2011. Manuscript accepted: 07-Jul-2011. used to hold values calculated in the ALU. The FPUR register is
Final manuscript received: 14-Jul-2011 used to hold results calculated in the FPU, which is used for multi-
clock cycle clock cycle
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
or ... IF RF EX MEM WB IF RF EX MEM WB
add R2,#1,R3 IF RF EX MEM WB IF RF EX MEM WB
sub R2,R3,R4 IF RF EX MEM WB IF RF EX MEM WB
and R5,#7,R3 IF RF EX MEM WB IF RF EX MEM WB
xor ... IF RF EX MEM WB IF RF EX MEM WB

(a) Traditional Insts (b) Traditional Pipelining (c) Static Pipelining

Fig. 1. Traditionally Pipelined vs. Statically Pipelined Instructions

Fig. 2. Possible Datapath of a Statically Pipelined Processor

cycle operations. If the PC is used as an input to the ALU (as a cases where too many effects are scheduled together, the compiler
PC-relative address computation), then the result is placed in the will attempt to move effects into surrounding instructions while
TARG register, otherwise it is placed in the ALUR register. The CP1 obeying structural hazards and dependencies. Only when the compiler
and CP2 registers are used to hold values copied from one of the cannot do so will an additional instruction be generated for these
other internal registers. These copy registers are used to hold loop- additional instruction effects.
invariant values and support simple register renaming for instruction A static pipeline can be viewed as a two-stage processor with the
scheduling. Since these internal registers are small, and can be placed two stages being fetch and everything after fetch. Because everything
near the portion of the processor that access it, they are accessible at after fetch happens in parallel, the clock frequency for a static pipeline
a lower energy cost than the register file. Because more details of the can be just as high as for a traditional pipeline. Therefore if the
datapath are exposed at the architectural level, changes to the micro- number of instructions executed does not increase as compared to
architecture are more likely to result in the need for recompilation. a traditional pipeline, there will be no performance loss associated
However this is less critical for embedded systems where the software with static pipelining. Section 3 will discuss compiler optimizations
on the system is often packaged with the hardware. Because these that will attempt to keep the number of instructions executed as low
registers are exposed at the architectural level, a new level of compiler as, or lower than, those of traditional pipelines.
optimizations can be exploited as we will demonstrate in Section 3.
Each statically pipelined instruction consists of a set of effects, III. C OMPILATION
each of which updates some portion of the processor. The effects that A statically pipelined architecture exposes more details of the dat-
are allowed in each cycle mostly correspond to what the baseline five- apath to the compiler, allowing the compiler to perform optimizations
stage pipeline can do in one cycle, which include one ALU or FPU that would not be possible on a conventional machine. This section
operation, one memory operation, two register reads, one register gives an overview of compiling for a statically pipelined architecture
write and one sign extension. In addition, one copy can be made with a simple running example, the source code for which can be
from an internal register to one of the two copy registers and the seen in Figure 3(a). The code above was compiled with the VPO
next sequential instruction address can optionally be saved in the [3] MIPS port, with full optimizations applied, and the main loop is
SEQ register. Lastly, the next PC can be assigned the value of one shown in Figure 3(b). In this example, r[9] is used as a pointer to
of the internal registers. If the ALU operation is a branch operation, the current array element, r[5] is a pointer to the end of the array,
then the next PC will only be set according to the outcome of the and r[6] holds the value m. The requirements for each iteration of
branch, otherwise, the branch is unconditionally taken. the loop are shown in Figure 3(c).
To evaluate the architecture, we allow any combination of effects to We ported the VPO compiler to the statically pipelined processor.
be specified in any instruction, which requires 64-bit instructions. In a In this chapter, we will explain its function and show how this exam-
real implementation, only the commonly used combinations would be ple can be compiled efficiently for a statically pipelined machine. The
able to be specified at a time, with a field in the instruction specifying process begins by first compiling the code for the MIPS architecture
which combination is used. Our preliminary analysis shows that it with many optimizations turned on. This is done because it was
should be practical to use 32-bit instructions with minimal loss in found that certain optimizations, such as register allocation, were
efficiency. The reason for this is that, while there are nine possible much easier to apply for the MIPS architecture than for the static
effects, a typical instruction will actually use far fewer. In the rare pipeline.
for(i = 0; i < 100; i++) L6: L6: SE = offset(L6); SE = 4; RS2 = r[6];
a[i] += m; RS1 = r[9]; RS1 = r[9]; TARG = PC + SE; CP2 = RS2; RS1 = r[9];
LV = M[RS1]; LV = M[RS1]; SE = 4; LV = M[RS1]; RS2 = r[5]; SEQ = PC + 4;
(a) Source Code r[3] = LV; RS2 = r[6]; RS2 = r[6];
RS1 = r[3]; ALUR = LV + RS2; CP2 = RS2; L6:
L6:
RS2 = r[6]; M[RS1] = ALUR; L6: ALUR = LV + CP2; RS1 = r[9];
r[3] = M[r[9]];
ALUR = RS1 + RS2; SE = 4; RS1 = r[9]; ALUR = RS1 + SE; M[RS1] = ALUR;
r[2] = r[3] + r[6];
r[2] = ALUR; ALUR = RS1 + SE; LV = M[RS1]; PC=ALUR!=RS2,SEQ; LV = M[ALUR]; r[9] = ALUR;
M[r[9]] = r[2];
RS1 = r[2]; r[9] = ALUR; ALUR = LV + CP2;
r[9] = r[9] + 4;
RS2 = r[9]; RS2 = r[5]; M[RS1] = ALUR; ALUR = LV + CP2; RS1 = r[9];
PC = r[9] != r[5], L6
M[RS2] = RS1; SE = offset(L6); ALUR = RS1 + SE; M[RS1] = ALUR;
RS1 = r[9]; TARG = PC + SE; r[9] = ALUR;
(b) MIPS Code
SE = 4; PC=ALUR!=RS2,TARG; RS2 = r[5]; (g) Code after Scheduling
ALUR = RS1 + SE; PC=ALUR!=RS2,TARG;
5 instructions 5 ALU ops
r[9] = ALUR; (e) Code after Common
8 RF reads 3 RF writes (f) Code after Loop
RS1 = r[9]; Sub-Expression
1 branch calcs. 2 sign extends Invariant Code Motion 3 instructions 3 ALU operations
RS2 = r[5]; Elimination
SE = offset(L6); 1 register file read 1 register file write
(c) MIPS requirements for 0 branch address calculations 0 sign extensions
TARG = PC + SE;
each array element
PC=RS1!=RS2,TARG;
(h) Static Pipeline requirements for each array element
(d) Initial Statically
Pipelined Code

Fig. 3. Example of Compiling for a Statically Pipelined Processor

VPO works with an intermediate representation called “RTLs” of the loop are overlapped using software pipelining [4]. With the
where each RTL corresponds to one machine instruction on the target MIPS baseline, there is no need to do software pipelining on this loop
machine. The RTLs generated by the MIPS compiler are legal for the because there are no long latency operations. For a statically pipelined
MIPS, but not for the statically pipelined processor. The next step in machine, however, it allows for a tighter main loop. We also pack
compilation, then, is to break these RTLs into ones that are legal for together effects that can be executed in parallel, obeying data and
a static pipeline. The result of this stage can be seen in Figure 3(d). structural dependencies. Additionally, we remove the computation of
The dashed lines separate effects corresponding to the different MIPS the branch target by storing it in the SEQ register before entering the
instructions in Figure 3(b). loop. The pipeline requirements for the statically pipelined code are
As it stands now, the code is much less efficient than the MIPS shown in Figure 3(h).
code, taking 15 instructions in place of 5. The next step then, is The baseline we are comparing against was already optimized
to apply traditional compiler optimizations on the initial statically MIPS code. By allowing the compiler access to the details of the
pipelined code. While these optimizations have already been applied pipeline, it can remove instruction effects that cannot be removed
in the platform independent optimization phase, they can provide on traditional machines. This example, while somewhat trivial, does
additional benefits when applied to statically pipelined instructions. demonstrate the ways in which a compiler for a statically pipelined
Figure 3(e) shows the result of applying common sub-expression architecture can improve program efficiency.
elimination which, in VPO, includes copy propagation and dead as-
signment elimination. This optimization is able to avoid unnecessary IV. E VALUATION
instructions primarily by reusing values in internal registers, which This section will present a preliminary evaluation using bench-
is impossible with the pipeline registers of traditional machines. marks compiled with our compiler and then hand-scheduled as
Because an internal register access is cheaper than a register file described in the previous section. The benchmarks used are the
access, the compiler will prefer the former. simple vector addition example from the previous section and the
While the code generation and optimizations described so far have convolution benchmark from Dspstone [14]. Convolution was chosen
been implemented and are automatically performed by the compiler, because it is a real benchmark that has a short enough main loop to
the remaining optimizations discussed in this section are performed make scheduling by hand feasible.
by hand, though we will automate them in the future. The first We extended the GNU assembler to assemble statically pipelined
one we perform is loop-invariant code motion, an optimization that instructions and implemented a simulator based on the SimpleScalar
moves instructions out of a loop when doing so does not change suite. In order to avoid having to compile the standard C library, we
the program behavior. Figure 3(f) shows the result of applying this allow statically pipelined code to call functions compiled for MIPS. In
transformation. As can be seen, loop-invariant code motion also can order to make for a fair comparison, we set the number of iterations to
be applied to statically pipelined code in ways that it can’t for 100,000. For both benchmarks, when compiled for the static pipeline,
traditional architectures. We are able to move out the calculation of over 98% of the instructions executed are statically pipelined ones,
the branch target and also the sign extension. Traditional machines are with the remaining MIPS instructions coming from calls to printf.
unable to break these effects out of the instructions that utilize them For the MIPS baseline, the programs were compiled with the VPO
so these values are repetitively calculated. Also, by taking advantage MIPS port with full optimizations enabled.
of the copy register we are able to move the read of r[6] outside Table I gives the results of our experiments. We report the number
the loop as well. We are able to create a more efficient loop due to of instructions committed, register file reads and writes and “internal”
this fine-grained control of the instruction effects. reads and writes. For the MIPS programs, these internal accesses are
While the code in Figure 3(f) is an improvement, and has fewer the number of accesses to the pipeline registers. Because there are
register file accesses than the baseline, it still requires more instruc- four such registers, and they are read and written every cycle, this
tions. In order to reduce the number of instructions in the loop, we figure is simply the number of cycles multiplied by four. For the
need to schedule multiple effects together. For this example, and the static pipeline, the internal accesses refer to the internal registers.
benchmark used in the results section, the scheduling was done by As can be seen, the statically pipelined versions of these programs
hand. Figure 3(g) shows the loop after scheduling. The iterations executed significantly fewer instructions. This is done by applying
TABLE I
R ESULTS OF THE E XPERIMENTAL E VALUATION

Benchmark Architecture Instructions Register Reads Register Writes Internal Reads Internal Writes
MIPS 507512 1216884 303047 2034536 2034536
Vector Add Static 307584 116808 103028 1000073 500069
Reduction 39.4% 90.4% 66.0% 50.8% 75.4%
MIPS 1309656 2621928 804529 5244432 5244432
Convolution Static 708824 418880 403634 2200416 1500335
Reduction 45.9% 84.0% 49.8% 58.0% 71.4%

traditional compiler optimizations at a lower level and by carefully We have shown how efficient code can be generated for simple
scheduling the loop as discussed in Section 3. The static pipeline benchmarks for a statically pipelined processor to target both perfor-
accessed the register file significantly less, because it is able to mance and power. Preliminary experiments show that static pipelining
retain values in internal registers with the help of the compiler. Also, can significantly reduce energy consumption by reducing the number
the internal registers are accessed significantly less than the larger of register file accesses, while also improving performance. With
pipeline registers. the continuing expansion of high-performance mobile devices, static
While accurate energy consumption values have yet to be assessed, pipelining can be a viable technique for satisfying next-generation
it should be clear that the energy reduction in these benchmarks would performance and power requirements.
be significant. While results for larger benchmarks may not be so
dramatic as these, this experiment shows that static pipelining, with ACKNOWLEDGEMENTS
appropriate compiler optimizations, has the potential to be a viable We thank the anonymous reviewers for their constructive comments
technique for significantly reducing processor energy consumption. and suggestions. This research was supported in part by NSF grants
CNS-0964413 and CNS-0915926.
V. R ELATED W ORK
R EFERENCES
Statically pipelined instructions are most similar to horizontal
micro-instructions [13]. Statically pipelined instructions, however, [1] K. Asanovic, M. Hampton, R. Krashinsky, and E. Witchel, “Energy-
Exposed Instruction Sets,” Power Aware Computing.
specify how to pipeline instructions across multiple cycles and are
[2] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An Infrastructure
fully exposed to the compiler. for Computer System Modeling,” Computer, vol. 35, no. 2, pp. 59–67,
Static pipelining also bears some resemblance to VLIW [7] in 2002.
that the compiler determines which operations are independent. [3] M. Benitez and J. Davidson, “A Portable Global Optimizer and Linker,”
However, VLIW instructions represent multiple RISC operations to ACM SIGPLAN Notices, vol. 23, no. 7, pp. 329–338, 1988.
[4] D. Cho, R. Ayyagari, G. Uh, and Y. Paek, “Preprocessing Strategy for
be performed in parallel, while static pipelining encodes individual Effective Modulo Scheduling on Multi-Issue Digital Signal Processors,”
instruction effects that can be issued in parallel, where each effect in Proceedings of the 16th International Conference on Compiler Con-
corresponds to an action taken by a single pipeline stage of a structions, Braga, Portugal, 2007.
traditional instruction. [5] H. Corporaal and M. Arnold, “Using Transport Triggered Architectures
for Embedded Processor Design,” Integrated Computer-Aided Engineer-
Other architectures that expose more details of the datapath to the ing, vol. 5, no. 1, pp. 19–38, 1998.
compiler are the Transport-Triggered Architecture (TTA) [5], the No [6] I. Finlayson, G. Uh, D. Whalley, and G. Tyson, “Improving Low
Instruction Set Computer (NISC) [10] and the FlexCore [12]. These Power Processor Efficiency with Static Pipelining,” in 15th Workshop
architectures rely on multiple functional units and register files to on Interaction between Compilers and Computer Architectures.
improve performance at the expense of an increase in code size. In [7] J. Fisher, “VLIW Machine: A Multiprocessor for Compiling Scientific
Code.” Computer, vol. 17, no. 7, pp. 45–53, 1984.
contrast, static pipelining focuses on improving energy consumption [8] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, and R. Brown,
without adversely affecting performance or code size. “MiBench: A Free, Commercially Representative Embedded Benchmark
Another related work is the Energy Exposed Instruction Set [1] Suite,” in Workload Characterization, 2001. WWC-4. 2001 IEEE Inter-
which adds some energy efficient features to a traditional architecture national Workshop on. IEEE, 2002, pp. 3–14.
[9] A. Kalambur and M. Irwin, “An Extended Addressing Mode for Low
such as accumulator registers and tagless memory operations when Power,” in Proceedings of the 1997 international symposium on Low
the compiler can guarantee a cache hit. power electronics and design. ACM, 1997, pp. 208–213.
[10] M. Reshadi, B. Gorjiara, and D. Gajski, “Utilizing Horizontal and
VI. F UTURE W ORK Vertical Parallelism with a No-Instruction-Set Compiler for Custom Dat-
apaths,” in ICCD ’05: Proceedings of the 2005 International Conference
One important piece of future work is to improve the optimizing on Computer Design. Washington, DC, USA: IEEE Computer Society,
compiler including the scheduling and software-pipelining. In addi- 2005, pp. 69–76.
tion we will develop and evaluate other compiler optimizations for [11] J. Scott, L. Lee, J. Arends, and B. Moyer, “Designing the Low-
Power MCORE TM Architecture,” in Power Driven Microarchitecture
this machine, including loop invariant code motion. Another area of Workshop. Citeseer, 1998, pp. 145–150.
future work will be encoding the instructions more efficiently. Lastly [12] M. Thuresson, M. Sjalander, M. Bjork, L. Svensson, P. Larsson-Edefors,
a model for estimating the energy consumption will be developed. and P. Stenstrom, “Flexcore: Utilizing Exposed Datapath Control for
Efficient Computing,” Journal of Signal Processing Systems, vol. 57,
no. 1, pp. 5–19, 2009.
VII. C ONCLUSION [13] M. Wilkes and J. Stringer, “Micro-Programming and the Design of the
In this paper, we have introduced the technique of static pipelin- Control Circuits in an Electronic Digital Computer,” in Mathematical
ing to improve processor efficiency. By statically specifying how Proceedings of the Cambridge Philosophical Society, vol. 49, no. 02.
Cambridge Univ Press, 1953, pp. 230–238.
instructions are broken into stages, we have simpler hardware and [14] V. Zivojnovic, J. VELARDE, and G. SCHL, “C. 1994. DSPstone:
allow the compiler more control in producing efficient code. Statically A DSP-Oriented Benchmarking Methodology,” in Proceedings of the
pipelined processors provide the performance benefits of pipelining Fifth International Conference on Signal Processing Applications and
without the energy inefficiencies of dynamic pipelining. Technology (Oct.).

View publication stats

You might also like