0% found this document useful (0 votes)
67 views10 pages

802.11a Transmitter: A Case Study in Microarchitectural Exploration

The document discusses microarchitectural exploration of an 802.11a wireless transmitter implemented in Bluespec SystemVerilog (BSV). It presents several highly parameterized designs of the transmitter in BSV with varying levels of parallelism and pipelining. Performance, area, and power results are reported for each design to analyze the area-power tradeoff. The ability to quickly implement and evaluate different microarchitectural options in BSV enabled thorough exploration that would have been difficult without such language features.

Uploaded by

zohebdhuka_libra
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views10 pages

802.11a Transmitter: A Case Study in Microarchitectural Exploration

The document discusses microarchitectural exploration of an 802.11a wireless transmitter implemented in Bluespec SystemVerilog (BSV). It presents several highly parameterized designs of the transmitter in BSV with varying levels of parallelism and pipelining. Performance, area, and power results are reported for each design to analyze the area-power tradeoff. The ability to quickly implement and evaluate different microarchitectural options in BSV enabled thorough exploration that would have been difficult without such language features.

Uploaded by

zohebdhuka_libra
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

802.

11a Transmitter: A Case Study in Microarchitectural Exploration

Nirav Dave, Michael Pellauer, Steve Gerding, & Arvind


Computer Science and Artificial Intelligence Lab
Massachusetts Institute of Technology
Cambridge, Massachusetts 02139
Email: {ndave, pellauer, sgerding, arvind}@mit.edu

Abstract bols/sec. A dedicated hardware block, on the other hand,


may be able to meet the performance rate even while oper-
Hand-held devices have rigid constraints regarding ating in the sub-MHz range and consequently consume two
power dissipation and energy consumption. Whether a new orders of magnitude less power.
functionality can be supported often depends upon its power In general, the power consumption of a design can be
requirements. Concerns about the area (or cost) are gener- lowered by parallelizing the design and running it at a lower
ally addressed after a design can meet the performance and frequency. This requires duplicating hardware resources. A
power requirements. Different micro-architectures have minimum power design can be achieved by lowering the
very different area, timing and power characteristics, and clock frequency and voltage to just meet the performance
these need RTL-level models to be evaluated. In this paper requirement. On the other hand we can save area by folding
we discuss the microarchitectural exploration of an 802.11a the hardware — reusing the same hardware over multiple
transmitter via synthesizable and highly-parametrized de- clock cycles — and running at a higher frequency. Though
scriptions written in Bluespec SystemVerilog (BSV). We also we are primarily concerned with keeping the power of the
briefly discuss why such architectural exploration would be entire system to a minimum, it is not obvious a priori which
practically infeasible without appropriate linguistic facili- particular transmitter microarchitecture is best: the large
ties. lowest power one, or a smaller higher-power design which
No knowledge of 802.11a or BSV is needed to read this frees up critical area resources for other parts of the design.
paper. The entire gamut of designs from highly parallel to very se-
rial must be considered.
In this paper we describe the implementation of several
1. Introduction highly-parametrized designs of the 802.11a transmitter in
Bluespec SystemVerilog (BSV). We present performance,
802.11a is an IEEE standard for wireless communica- area, and power results for all of these designs. We discuss
tion [4]. The protocol translates raw bits from the MAC the benefits of a language with a strong enough type sys-
into Orthogonal Frequency-Division Multiplexing (OFDM) tem and static elaboration capability to express and imple-
symbols or sets of 64 32-bit fixed-width complex numbers. ment non-trivial parametrization. We conclude by arguing
The protocol is designed to operate at different data rates; at that without these kinds of linguistic facilities, such archi-
higher rates it consumes more input to produce each sym- tectural exploration becomes significantly more difficult, to
bol. Regardless of the rate, all implementations must be the point where it may not take place at all. In fact, the main
able to generate an OFDM symbol every 4 microseconds. contribution of this paper is to show, by a real example, that
Because wireless protocols generally operate in portable such exploration is possible early in the the design process
devices, we would like our designs to be as energy-efficient and yields deep insight into the area-power tradeoff.
as possible. This energy requirement makes software-based No knowledge of 802.11a is needed to understand the ar-
implementations of the transmitter unreasonable; a good chitectural explorations discussed in this paper. We explain
implementation on a current-generation programmable de- BSV syntax as we use it. However, some familiarity with
vice would need thousands of instructions to generate a sin- Verilog syntax is assumed.
gle symbol, requiring the processor to run in the hundreds Organization: We begin with an overview of the 802.11a
of MHz range to meet the performance rate of 250K sym- transmitter and show why it is important to focus on the
rates ({6, 12, 24} Mb/s) of the 802.11a specification. At
these rates the Puncturer does no operation on the data, so
we will omit it from our design and discussion.
Controller Scrambler Encoder
Interleaver: The Interleaver operates on the OFDM sym-
bol, in block sizes of 48, 96, or 192 bits depending on which
rate is being used. It reorders the bits in a single packet. As-
Interleaver Mapper suming each block only operates on 1 packet at a time, this
means that at the fastest rate we can expect to output only
once every 4 cycles.
IFFT
Cyclic Mapper: The mapper also operates on an OFDM sym-
Extend
bol level. It takes the interleaved data and translates it di-
rectly into the 64 complex numbers representing different
frequency “tones.”
IFFT: The IFFT performs an 64-point inverse Fast Fourier
Transform on the complex frequencies to translate them into
Figure 1. 802.11a Transmitter Design the time domain, where they can then be transmitted wire-
lessly. Our initial implementation (discussed in greater de-
tail in Section 3) was a combinational design based on a
IFFT block (Section 2). In Section 3 we present a combi-
4-point butterfly.
national circuit implementation of the IFFT and use it as a
Cyclic Extender: The Cyclic Extender extends the IFFT-ed
reference implementation in the rest of the paper. We also
symbol by appending the beginning and end of the message
show the power of BSV functions and parametrization in
to the full message body.
the design of combinational circuits. In Sections 4, 5, and 6
we discuss general microarchitectural explorations and how
they are applied to our transmitter pipeline. In Section 7 we 2.1. Preliminary Design Synthesis
discuss the performance, area, and some power characteris-
tics of each design. In Section 8 we discuss related work. When we began this project we had a good description
Finally we conclude by discussing the role of HDLs in de- of the 802.11a transmitter algorithm available to us. It took
sign exploration and reusable IP packaging. approximately three man-days to understand and code up
the algorithm in BSV (the authors are BSV experts). This
time included coding a library of arithmetic operations for
2. 802.11a Transmitter Design complex numbers. This library is approximately 265 lines
of BSV code.
The 802.11a transmitter design can be decomposed into The RTL for our initial design was generated using
separate well-defined blocks as shown in the Figure 1. the Bluespec Compiler (version 3.8.67), and synthesized
Controller: The Controller receives packets from the MAC with Synopsis Design Compiler (version X-2005.09) with
layer as a stream of data. The Controller is responsible for TSMC 0.18µm standard cell libraries. In this design,
creating header packets for each data packet to be sent as the steady state throughput at the highest-supported rate
well as for making sure that each part of the data stream (24Mb/s) was 1 symbol for every four clock cycles. There-
which comprises a single packet has the correct control an- fore, we needed a clock frequency of 1MHz to meet the 4
notations (e.g. the encoding rate). microsecond requirement. The clock frequency for synthe-
Scrambler: The Scrambler XORs each data packet with sis was set to 2 MHz to provide sufficient slack for place-
a pseudo-random pattern of bits. This pattern is concisely and-route. With this setting the initial implementation had
described at 1-bit per cycle using a 7-bit shift register and 2 an area of 4.69 mm2 or roughly 500K 2X1-NAND gates
XOR gates. A natural extension of this design would be to equivalents. The breakdown of the lines of code and rela-
unroll the loops to operate on multiple bits per cycle. The tive areas for each block is given in Figure 2.
initial value of the shift register is reset for each packet. We can see that the number of lines of source code for
Convolutional Encoder: The Convolutional Encoder gen- a block have no correlation with the size of the area the
erates 2 bits of output for for every input bit it receives. block occupies. The Convolutional Encoder requires the
Similar to the scrambler, the design can be described con- most code to describe, but takes effectively no area. The
cisely as 1-bit per cycle with a shift register and a few XOR IFFT, on the other hand, is only slightly shorter, yet repre-
gates. Again unrolling the loop is an obvious and natural sents a substantially larger fraction of the total area. Addi-
parametrization. tionally, the critical path of the IFFT is many times larger
Puncturer: Our design only implements the lowest 3 data than the critical path of any other block in the design.
Given these preliminary statistics, we focused our efforts
on the design of the IFFT block. Ultimately we will present
seven different designs for the transmitter created by plug-
ging in seven different variations of the IFFT block.

Design Block Lines of


Relative Area
Code
Controller 49 0%
Scrambler 40 0%
Conv. Encoder 113 0%
Interleaver 76 1%
Mapper 112 11%
Figure 3. Combinational IFFT Module
IFFT 95 85%
Cyc. Extender 23 3%
ated by the compiler, and the resulting combinational logic
Figure 2. Initial Design Results inlined. There is no notion of a run-time stack in BSV: even
recursive function calls are totally unfolded at compile time.

3. Baseline: Combinational IFFT

In this section we describe the combinational IFFT block


from our initial transmitter design. This implementation
serves both as a reference implementation for verification
and as a baseline to compare with our alternative microar-
chitectures.
At a high level, the n-point IFFT can be partitioned into
logk (n) stages of nk k-point “butterfly” submodules (bflyk
for short). At the end of each stage of butterflies, the output
values are permuted before being passed to the next stage.
It would be straightforward to create an IFFT description Figure 4. The bfly4 Circuit: twiddle values t[n]
parametrized by n but such parametrization is not needed in are statically-known parameters
this design because the 802.11a specification requires only
a 64-point IFFT.
Even limited to a 64-point IFFT, we are free to choose The bfly4 circuit is shown in Figure 4. It takes as input 4
the size of the butterfly submodule. An IFFT using bfly4s complex numbers to be transformed, and three “twiddle”
uses fewer arithmetic operators than one using bfly2s. By factors which represent an initial rotation. The output of the
similar reasoning, a bfly8-based design uses fewer opera- circuit is four new complex numbers. This circuit can be
tors than bfly4. However, larger butterflies constrain where described in Bluespec as the following function:
the computation can be partitioned, and can limit further mi- function Vector#(4,Complex#(n))
bfly4(Vector#(3,Complex#(n)) twids,
croarchitectural choices. To simplify the discussion we only
Vector#(4,Complex#(n)) xs);
consider bfly4 sub-blocks in this work. Figure 3 shows the
circuit structure of our combinational IFFT. Vector#(4, Complex#(n)) retval = newVector(),
tao = newVector(),
alpha = newVector();
3.1. The bfly4 Function
Complex#(n) rots[0] = xs[0];
There are many similarities between combinational hard- Complex#(n) rots[1] = twids[0] * xs[1];
Complex#(n) rots[2] = twids[1] * xs[2];
ware logic and functions in a software language. Both have
Complex#(n) rots[3] = twids[2] * xs[3];
well-defined inputs and outputs and are composed of simple
sub-functions (in hardware, gates and in software, assembly Complex#(n) temp[0] = rots[0] + rots[2];
instructions). Intermediate values have a hardware analog Complex#(n) temp[1] = rots[0] - rots[2];
Complex#(n) temp[2] = rots[1] + rots[3];
in wires. Complex#(n) temp[3] = rots[1] - rots[3];
In Bluespec a function definition corresponds exactly to
a combinational logic definition. A function call is evalu- // rotate temp_3 by 90 degrees
temp[3] = mult_by_i(temp[3]); rules out any word level simplification (e.g., multiplication
by 1 or 0). But the bit-level representation of each constant
retv[0] = temp[0] + temp[2];
retv[1] = temp[1] - temp[3];
allows the gate-level constant propagation to dramatically
retv[2] = temp[0] - temp[2]; simplify the area-intensive multiply circuit. A general bfly4
retv[3] = temp[1] + temp[3]; circuit which takes all values as inputs is 2.5 times larger
than a specialized circuit with statically known twiddle fac-
return retv;
endfunction
tors (208µm2 vs. 83µm2 )!
Later, we will explore other variants of the IFFT where
Note that the Complex type has been written in such a the multipliers cannot be statically optimized, but sharing
way that it represents complex numbers of any bit-precision of hardware occurs at a higher level.
n. Type parameters are indicated by the # sign. For ex-
ample, Vector#(4, Complex#(n)) is a vector of 4
3.2. IFFT Function
n-bit precision complex numbers. The newVector func-
tion creates an uninitialized vector. The Complex type is
a structure consisting of real and imaginary parts (i and q A straightforward way to represent the IFFT of Figure 3
respectively): is to write each bfly4 block explicitly. We use vectors to
represent the intermediate values (wires) between the dif-
typedef struct { ferent stages of bfly4 blocks. We also need to define the
SaturatingBit#(n) i; specific twiddle values used for each bfly4 block. Similarly
SaturatingBit#(n) q; we will define a vector to represent the permutations be-
} Complex(type n);
tween stages. Expressed as a function we get:
All arithmetic on complex numbers is defined in terms function ifftA(Vector#(64,Complex#(16)) x);
prebfly0 = x;
of saturating fixed-point arithmetic. For lack of space we twid_0_0 = ... ;
omit the description of these complex operators — all of twid_0_1 = ... ;
them have been implemented in BSV using ordinary inte- // Stage 1
ger arithmetic and bitwise operations. The parametrization postbfly0[3:0] = bfly4(twid_0_0,
prebfly0[3:0]);
of the bfly4 is realized partially through the overloading of ...
arithmetic operators; the compiler is able to select statically postbfly0[63:60] = bfly4(twid_0_15,
the correct operator based on its type. prebfly0[63:60]);
//Permute 1
A polymorphic BSV function such as the above bfly4 de-
prebfly1[0] = postbfly0[0];
scription can be thought of as a combinational logic gener- prebfly1[1] = postbfly0[4];
ator. The Bluespec compiler instantiates the bfly4 function ...
for a specific bit-width during a compiler phase known as // Stage 2
postbfly1[3:0] = bfly4(twid_1_0,
static elaboration. During this phase the compiler: prebfly1[3:0]);
...
• Instantiates functions and submodules with specific postbfly1[63:60] = bfly4(twid_1_15,
parameter values prebfly1[63:60]);
//Permute 2
• Unrolls loops and recursive function calls prebfly2[0] = postbfly1[0];
prebfly2[1] = postbfly1[4];
• Performs aggressive constant propagation including ...
the propagation of don’t-care values // Stage 3
postbfly2[3:0] = bfly4(twid_2_0,
prebfly2[3:0]);
We can leverage static elaboration to create a more con- postbfly2[7:4] = bfly4(twid_2_1,
cise, more generalized description [1]. For example, a vec- prebfly2[7:4]);
tor in the above function is simply a convenient way of ...
grouping wires. The vector addresses are statically known final[0] = postbfly2[0];
final[1] = postbfly2[4];
and do not survive the elaboration phase. In our design we return(final[63:0]);
often use vectors of variables, and vectors of submodules endfunction
such as registers.
In fact this is how many Verilog programmers would write
In hardware compilation constant propagation some-
this code, probably using their favorite generation scripts.
times achieves results which are surprising from a software
point of view. For example, for each bfly4 in our com- Improved Representations in BSV: Looking at this de-
binational design the twiddle factors are statically known. scription we can see a lot of replication. Each bfly4 in the
Each twiddle factor is effectively a random number which stage has a very regular pattern in its input parameters. If
we organize all of the twiddles and permutations as vectors, out[i] = postbfly[stage][permute[stage][i]];
then we can rewrite each stage using loops: end

function ifftB(Vector#(64,Complex#(16)) x); return(out[2][63:0]);


//compute following constants at compile time endfunction
twid0[0] = ... ; ... twid0[47] = ... ;
twid1[0] = ... ; ... twid1[47] = ... ;
twid2[0] = ... ; ... twid2[47] = ... ; Now we have a concise description of the hardware, exactly
what an experienced BSV designer would have written in
permute[2:0][63:0] = ... ; the first place. We have not shown the code for generat-
ing the twiddles and the permutation. Fortunately, like the
prebfly0 = x;
bfly4 organization, these constants have a mathematical def-
//Stage 1 inition which can be represented as a function using sines,
for(Integer i = 0; i < 16; i = i + 1) cosines, modulo, multiply, etc. A good question to ask is if
postbfly0[4*i+3 : 4*i] =
bfly4( twid0[3*i+2 : 3*i],
such a description still represents good combinational logic.
prebfly0[4*i+3 : 4*i]); Again, just like the bfly4 parametrization, the inputs to each
for(Integer i = 0; i < 64; i = i + 1) call to these functions are statically known. Consequently
prebfly1[i] = postbfly0[permute[0][i]]; the compiler can aggressively optimize away the combina-
//Stage 2
tional logic to produce circuits exactly the same as ifftA.
for(Integer i = 0; i < 16; i = i + 1)
As noted in the last section this IFFT design occupies
postbfly1[4*i+3 : 4*i] =
bfly4( twid1[3*i+2 : 3*i], roughly 85% of the total area and has a critical path several
prebfly1[4*i+3 : 4*i]); times larger than the critical path of any other block. Next
for(Integer i = 0; i < 64; i = i + 1) we explore how to reuse parts of this hardware to reduce the
prebfly2[i] = postbfly1[permute[1][i]];
area. All such designs involve introducing registers to hold
//Stage 3 intermediate values. As a stepping store to the designs that
for(Integer i = 0; i < 16; i = i + 1) reuse hardware we first describe a simple pipelining of the
postbfly2[4*i+3 : 4*i] = combinational IFFT in the next section. This will also have
bfly4( twid2[4*i+3 : 4*i],
prebfly2[4*i+3 : 4*i]);
the effect of reducing the critical path of the IFFT design by
for(Integer i = 0; i < 64; i = i + 1) a factor of three.
final[i] = postbfly2[permute[2][i]];

return(final[63:0]);
endfunction
4. The Pipelined IFFT
This new organization makes no change to the represented
hardware. After the compiler unrolls the for-loops and does
constant propagation the result is the exact same gate struc-
ture as ifftA. At a high level, pipelining is simply partitioning a task
Here we can see another level of regularity. This time it into a sequence of smaller sub-tasks which can be done in
lies across all of the stages. We can rewrite the rule as: parallel. We can start processing the next chunk of data
function ifftC(Vector#(64,Complex#(16)) x); before we finish processing the previous chunk. Generally,
the stages of a pipeline operate in lockstep (see Figure 5).
//compute following constants at compile time

twid[2:0][47:0] = ... ;
permute[2:0][63:0] = ... ;

for(Integer stage=0; stage<3; stage=stage+1)


begin
if (stage == 0)
prebfly[stage][63:0] = x;
else
prebfly[stage][63:0] = out[stage-1];

for(Integer i = 0; i < 16; i = i + 1) Figure 5. Synchronous Pipeline


postbfly[stage][4*i+3 : 4*i] =
bfly4( twid[stage][3*i+2 : 3*i],
prebfly[stage][4*i+3 : 4*i]);
In BSV, we can represent such a 3-stage pipeline using the
for(Integer i = 0; i < 64; i = i + 1) following Guarded Atomic Action, or rule:
rule sync-pipeline (True); the pipelined IFFT; we only need to replace these function
let sx0 = inQ.first(); calls representing the 3 bfly stages. In fact we can gener-
inQ.deq();
sReg1 <= f0(sx0);
alize further by writing a single function stage f given
let sx1 = sReg1; below. Most of the code for the stage f is taken from the
sReg2 <= f1(sx1); outer loop body of ifftC. The first parameter selects one
let sx2 = sReg2; of the 3 possible stage operations.
outQ.enq(f2(sx2));
endrule
function stage_f(Bit#(2) stage,
Vector#(64, Complex#(n)) prebfly);
A rule consists of a set of actions that alter the state and
a predicate (guard) which signifies when it is valid for Vector#(64, Complex#(n)) out = newVector();
these state changes to occur. The state altered by the above
rule consists of two fifos (inQ, outQ) and two registers for(Integer i = 0; i < 16; i = i + 1)
postbfly[stage][4*i+3 : 4*i] =
(sReg1, sReg2); the actions are to set the value of reg- bfly4( twids[stage][3*i+2 : 3*i],
isters (e.g., sReg2 <= f2(sx1)), or to enqueue or de- prebfly[stage][4*i+3 : 4*i]);
queue from a fifo. The actions are combined into one atomic
for(Integer i = 0; i < 64; i = i + 1)
action which are inherently parallel. This means that all the
out = postbfly[stage][permute[stage][i]];
reads happen before all writes as in non-blocking assign- return(out[63:0]);
ments in Verilog. endfunction
This rule shows that all stages of the pipeline read a value
from the previous stage and store the value for the next Now f 0 (and similarly f 1 and f 2) can be defined as fol-
stage. According to BSV semantics, this rule won’t fire lows:
if either inQ is empty or outQ is full. These conditions
are implicit and incorporated into the rule predicate by the function Vector#(64, Complex#(n)) f0(
Vector#(64, Complex#(n)) x);
compiler. For details of Bluespec synthesis and scheduling return(stage_f(0,x));
see [3] and [6]. endfunction
Though straightforward to write, this rule has some prob-
lems. First, some registers may not hold valid values, for We note in passing that we can also write the syn-
instance, when the pipeline has just started to fill. We chronous pipeline rule so that it is parametrized by the num-
must be careful not to enqueue junk values into the out- ber of stages. Assume that we have a combinational func-
put queue. Second, since we do not move data when we tion f , similar to our stage f function, which takes two
cannot take a value from inQ, the last few values will be parameters: the current stage i and the input value x and
left in the pipeline. BSV provides elegant solutions to both returns the result of fi on x. The following now models an
these problems. We can prevent junk values from entering n stage synchronous pipeline:
by providing a valid bit for each value and explicitly pred- Vector#(TSub#(n,1), Reg#(Maybe#(data_T)))
icating the enqueue operation on outQ for valid data only. sRegs = newVector();
In BSV this is done using the Maybe datatype, which is a //Instantiate n registers
tagged union: for (Integer i = 0; i < n - 1 ; i = i + 1)
sRegs[i] <- mkReg(Invalid);

typedef union tagged { rule sync-pipeline (True);


void Invalid; Maybe#(data_T) sx;
data_T Valid; for (Integer i = 1; i < n; i = i + 1)
} Maybe#(type data_T); begin
//Get stage input
This is equivalent to adding a valid bit to any datatype. if (i != 0)
sx = sRegs[i-1];
We declare the registers to hold a Maybe type. Then we else
replace the inQ with a FIFO which returns Invalid when if (inQ.notEmpty)
empty and outQ with a FIFO which accepts Maybe values begin
(by simply ignoring enqueued Invalid). sx = inQ.first();
inQ.deq();
Once this is done, we can easily prevent values from be- end
coming stuck by checking to see if the input queue is empty else
and using the Invalid value in place of attempting to re- sx = Invalid;
move an input.
//Calculate value
Note that the rule was already parametrized by functions Maybe#(data_T) ox;
f 0, f 1 and f 2. This level of parametrization is sufficient for case(sx) matches
tagged Valid .x: rule folded-pipeline (True);
ox = f(fromInteger(i),x); if (stage==0) in.deq();
tagged Invalid: sxIn = (stage==0) ? inQ.first() : sReg;
ox = Invalid; sxOut = f(stage, sxIn);
endcase if (stage==n)
outQ.enq(sxOut);
//Write Outputs else
if(i == n-1) sReg <= sxOut;
outQ.enq(ox); stage <= (stage == n)? 0 : stage + 1;
else endrule
sRegs[i] <= ox;
endrule The stage function for a folded pipeline with three stages
may be written as follows:
It is important to keep in mind that because the stage
parameter is known at compile time, the compiler can opti- function f(stage, sx);
mize each call of f to be specific to each stage. case(stage)
0: return f0(sx);
1: return f1(sx);
2: return f2(sx);
endcase
endfunction
5. Folded or Circularly-Pipelined IFFT

Our synchronous IFFT pipeline can produce a result ev-


ery clock cycle. This is overkill as we can’t produce that
rate of input from the transmitter. If we can meet the spec-
ifications by producing a data element every three cycles
then it may be possible to fold all three stages from the
pipeline in Figure 5 into one, saving area. An example of
such a pipeline structure is the folded pipeline shown in Fig-
ure 6, which assumes that all stages do identical computa-
tion represented by function f . In this structure a data ele-
ment enters the pipeline, goes around three times, and then Figure 7. Function f
is ejected.
Considering the pipelines in Figures 5 and 6 and this defi-
nition of function f (shown in Figure 7) it is difficult to see
how any hardware would be saved. The folded pipeline uses
one pipeline register instead of two but it also introduces po-
tentially two large muxes one at the input and one at the out-
put of function f . If f 1, f 2 and f 3 represent large combi-
national circuits then the real gain can come only by sharing
the common parts of these circuits because these functions
will never operate together at the same time. As an example,
suppose each fi is composed of two parts: a sharable part fs
and an unsharable part fui and fi(x) = fs(fui(x)). Given this
Figure 6. Folded or Circular Pipeline Design information we could have written function f as follows:

function f (stage,sx);
let sxt = case (stage)
In a folded pipeline, since the same hardware is used 0: return fu0(sx);
for conceptually different stages, we often need some extra 1: return fu1(sx);
state elements and muxes to choose the appropriate combi- 2: return fu2(sx);
national logic. For example it is common to have a stage endcase;
return fs(sxt);
counter and associated control logic to remember where endfunction
the data is in the pipeline. The code for an n-way folded
pipeline such as shown in Figure 6 may be written as fol- where fs(sx) represents the shared logic among the three
lows: stages (Figure 8). A compiler may be able to do this level of
6. Further Hardware Reuse: Super-Folded
Pipeline IFFT

We wanted to determine if we could further reduce area


by using less then 16 bfly4s (i.e. by folding the stage func-
tion in the folded pipeline). The stage function given eariler,
can be modified to support this. We can “chunk” the i loop
in the stage f function to make use of m (< 16) bfly4s
as follows:

Figure 8. Function f with explicit sharing for(Integer i1 = 0; i1 < 16; i1 = i1 + m)


for(Integer i2 = 0; i2 < m; i2 = i2 + 1)
begin
let i = i1 + i2;
common subexpression elimination automatically, but this postbfly[stage][4*i+3 : 4*i] =
form is guaranteed to generate the expected hardware. bfly4( twids[stage][3*i+2 : 3*i],
prebfly[stage][4*i+3 : 4*i]);
It turns out that the stage f function given earlier, ef-
fectively captures the sharing of the 16 bfly4 blocks in the
design. To our dismay, In spite of this sharing, we found This change by itself has no effect on the generated hard-
that the folded pipeline area (5.89mm2 ) turned out to be ware as both loops would be unrolled fully. What we would
larger than the area of the simple pipeline (5.14mm2 )! like to is to build a new superfolded stage function using the
inner i2 loop.
Sharing of the bfly4s comes at a cost. Since each bfly4
Consider the case of m = 2. What we would like is
in our design uses different twiddle constants in different
to pick up the data (64 complex numbers) from the input
stages it is no longer possible to take advantage of the con-
queue, go around the m bfly4s 48 m (= 24) times, and then en-
stant twiddle factors in our optimizations. Recall from our
queue the result (64 complex numbers) in the output queue.
earlier discussion in Section 3 that this results in an increase
In each iteration, we will take the entire set of 64 complex
in area of a factor of 2.5; close to the three-fold reduction
numbers, but only manipulate 4 ∗ m(= 8) numbers (us-
we expect from folding! In addition to this, additional area
ing the m bfly4s) and leave the rest unchanged. We can
overhead is introduced by the muxes required to pass differ-
deal with permutations by applying the permutation every
ent twiddles into the bfly4s. 16
On further analysis, we discovered a contributing factor m (= 8) cycles (i.e. when all new data has been calculated).
Taking this approach, the new stage function with m
to the lack of sharing was the use of different permutations
bfly4 blocks is:
for different stages implying one more set of muxes. Once
we realized this, we made an algorithmic adjustment and
function Vector#(64,Complex#(n)) stage_f_m_bfly4
recalculated the constants so that the permutations were the (Bit#(6) stage,
same for each stage in the design, removing these muxes. Vector#(64,Complex#(n)) s_in);
As can be seen in Figure 9, the folded transmitter design
using the new algorithm took 75% of the area of the simple Vector#(64,Complex#(n)) s_mid = s_in;
pipeline design (3.97 vs 5.25). Bit#(6) st16 = stage % 16;
It’s not clear how much of this was due to the improved for(Bit#(6) i2=st16; i2 < st16+n; i2=i2+1)
sharing, as we had to change the twiddle constants, which s_mid[4*i2+3:4*i2]
= bfly4(twid(stage[5:4],i2),
may have changed the possible optimizations. A possible s_in[4*i2+3:4*i2]);
explanation for the increased area for the combinational and
simply pipelined IFFTs is the lack of twiddle-related opti- // permute
mizations. Vector#(64,Complex#(n)) s_out = newVector();
for(Integer i = 0; i < 64; i = i + 1)
s_out[i] = s_mid[permute[i]];
Design Old Area(mm2 ) New Area(mm2 )
Combinational 4.69 4.91 //return permuted value is appropriate
return ((st16+m == 16) ? s_out[63:0]:
Simple Pipe 5.14 5.25 s_mid[63:0]);
Folded Pipe 5.89 3.97 endfunction

Figure 9. IFFT Area Results: New vs. Old Al-


gorithm Our new stage function has the exact same form as our origi-
nal folded design, differing only in the number of iterations
required to complete the computation. Consequently, the therefore candidates for the final 802.11a transmitter block
rule is also almost the same: in our full system. While the pipelined IFFT design itself is
Pareto optimal for IFFT designs, the inability to gain further
rule super-folded-pipeline(True); power reduction from voltage scaling causes it to be worse
sx = (stage == 0) ? inQ.first(): sReg;
if(stage == 0)
overall than the combinational design.
inQ.deq(); In our analysis it became clear that some parts of the
Vector#(64,Complex#(n)) sout = 802.11a pipeline (e.g. the IFFT) were constantly generat-
stage_f_m_bfly4(stage, sx); ing data at full throughput, while other parts cannot be con-
//constant integer divide (optimized away)
stantly busy. Thus, these idling blocks could operate at a
if(stage == (3*16 div m) - 1) much lower frequency without affecting performance. This
begin seems to indicate that a multiple-clock domain [2] design
outQ.enq(sout); should be explored as it could further improve the power
stage <= 0;
end usage results.
else Lastly, since Place and Route takes a long time (addi-
begin tional tens of hours), our area and timing numbers were
sReg <= sout;
generated at the synthesized gate level. Methodologically
stage <= stage + 1;
end this is acceptable because one is primarily interested in the
endrule relative merits of each design. To ensure that each design
could meet timing after place and route, we placed the re-
7. Results quired clock period for synthesis as half of that needed to
meet the performance requirement. We plan to run our de-
signs through Place and Route to generate more accurate
In our explorations we described many different versions
area and power numbers to validate our estimates.
of the 802.11a transmitter, varying only the IFFT block im-
plementation. We used 7 specific IFFT designs: the com-
binational version, a synchronous pipeline version, and 5 8. Related Work
super-folded pipeline versions with 16, 8, 4, 2, and 1 bfly4
nodes respectively. Efficient implementations of 802.11a is an active field of
We were able to perform this exploration very quickly. research. Here we limit consideration to works we consider
Including the initial 3 man-days to describe the initial de- to be representative of the broader approaches in the field.
sign, generating the variants took only an additional 2 man- Maharatna et al. [5] demonstrate that the IFFT can be
days. This includes the changing of the algorithm to use effectively implemented as an ASIC using a similar design
constant permutations. flow. While their specific microarchitecture achieves results
We took all 7 transmitter designs using our new constant competitive with commercial systems, they do not present
permutation IFFT algorithm and ran them through gate- the results of any architectural exploration. With a general-
level synthesis (as described in Section 2) to get area es- ized description we could explore whether a variant of their
timates. Each of these synthesis runs took approximately system would achieve even better results.
8 hours. We then fed these gate-level descriptions into Se- Zhang and Broderson [8] conducted a case study of
quence Design PowerTheatre to get power estimates. All several possible implementations of both the IFFT of the
these results are presented in Figure 10. 802.11a transmitter and the Viterbi block of a receiver.
At their maximum frequencies, all the designs easily met Rather than exploring standard tool-flow ASICs they used
the performance criteria of producing a symbol every 4 mi- Function-Specific Reconfigurable Hardware (FSRH), stan-
croseconds. As expected, the pipelined design was largest, dard cell ASICs which retains some dynamic configuration
followed by the combinational, and the folded designs in capabilities. Their work shows that the the FSRH imple-
descending order of the number of bfly4 blocks used. mentation significantly outperforms DSP and FPGA imple-
For power, we see that as we fold our pipeline we take mentations in terms of energy efficiency and computation
more power. This makes intuitive sense; folding forces the density. We expect that non-reconfigurable ASICs such as
remaining bfly4 blocks to be general, and causes each in- those presented here to be at least as efficient.
dividual block to be switched proportionally faster, which An alternative strategy to implementing an efficient FFT
means more power usage from the IFFT (plus the additional is to use high-latency RAM banks to store the complex
power from a global increase in clock frequency across the numbers and a small processing element to load addresses
entire clock domain). from the RAM, process them, and write them back. Son
After our exploration we found that the 5 folded designs, et al. [7] compared such an implementation to a pipelined
and the combinational design were all Pareto optimal and FFT as presented in Section 4. They conclude that although
Symbol
Transmitter Design Area Throughput Min. Freq to Avg. Power
Latency
(IFFT Block) (mm2 ) (cycle/symbol) Achieve Req. Rate (mW)
(cycles)
Combinational 4.91 10 04 1.0 MHz 3.99
Pipelined 5.25 12 04 1.0 MHz 4.92
Folded (16 Bfly4s) 3.97 12 04 1.0 MHz 7.27
Folded (8 Bfly4s) 3.69 15 06 1.5 MHz 10.9
Folded (4 Bfly4s) 2.45 21 12 3.0 MHz 14.4
Folded (2 Bfly4s) 1.84 33 24 6.0 MHz 21.1
Folded (1 Bfly4) 1.52 57 48 12.0 MHz 34.6

Figure 10. Performance of 802.11a Transmitters for Various Implementations of the IFFT Block

a RAM-based implementation uses less area, it requires a like to thank Hadar Agam of Bluespec Inc., for her invalu-
significantly higher clock-speed, and thus is less power- able help in getting power estimates using Sequence Design
efficient overall. PowerTheatre.

9. Conclusions References
In this paper we explored various microarchitectures of [1] Arvind, R. S. Nikhil, D. L. Rosenband, and N. Dave. High-
an IFFT block, the critical resource-intensive module in an level Synthesis: An Essential Ingredient for Designing Com-
802.11a transmitter. We demonstrated how languages with plex ASICs. In Proceedings of ICCAD’04, San Jose, CA,
powerful static elaboration capabilities can result in hard- 2004.
ware descriptions which are both more concise and more [2] E. Czeck, R. Nanavati, and J. Stoy. Reliable design with mul-
general. We used such generalized descriptions to explore tiple clock domains. In Proceedings of Formal Methods and
the physical properties of a wide variety of microarchitec- Models for Codesign (MEMOCODE), 2006.
[3] J. C. Hoe and Arvind. Synthesis of Operation-Centric Hard-
tures early in the design process. We argue that such high-
ware Descriptions. In Proceedings of ICCAD’00, pages 511–
level language capabilities are essential if future architec- 518, San Jose, CA, 2000.
tural decisions are to be based on empirical evidence rather [4] IEEE. IEEE standard 802.11a supplement. Wireless LAN
than designer intuition. Medium Access Control (MAC) and Physical Layer (PHY)
All the folded designs were generated from the same Specifications, 1999.
source description, only varying the input parameters. Even [5] K. Maharatna, E. Grass, and U. Jagdhold. A 64-Point Fourier
the other versions share a lot of common structure, such as Transform Chip for High-Speed Wireless LAN Application
the bfly4 definition and the representation of complex num- Using OFDM. IEEE JOURNAL OF SOLID-STATE CIR-
bers and operators. This has a big implication for verifica- CUITS, 39(3), March 2004.
[6] D. L. Rosenband and Arvind. Modular Scheduling of
tion, because instead of verifying seven designs, we had to
Guarded Atomic Actions. In Proceedings of DAC’04, San
verify only three, and even these three leveraged submod-
Diego, CA, 2004.
ules which had been unit-tested independently. [7] B. S. Son, B. G. Jo, M. H. Sunwoo, and Y. S. Kim. A High-
The six Pareto optimal designs we generated during ex- Speed FFT Processor for OFDM Systems. In Proceedings of
ploration provide some good intuition into the area-power the IEEE International Symposium on Circuits and Systems,
tradeoff possible in our design. To reduce our the area of pages 281–284, 2002.
our initial (combinational) design by 20% (the folded de- [8] N. Zhang and R. W. Brodersen. Architectural evaluation of
sign), we increase our power usage by 75%. This tradeoff flexible digital signal processing for wireless receivers. In
becomes less costly as we further reduce the design; if we Proceedings of the Asilomar Conference on Signals, Systems
wish to reduce the size by 70%, we increase the power us- and Computers, pages 78–83, 2000.
age by 760%.
In the future we wish to apply this methodology to more
complex designs, such as the H.264 video decoder blocks.
Such designs would have many more critical design blocks,
and better emphasize the benefits of out methodology.

Acknowledgments: The authors would like to thank Nokia


Inc. for funding this research. Also, the authors would

You might also like