Very Low-Complexity Hardwareinterleaver For Turbo Decoding
Very Low-Complexity Hardwareinterleaver For Turbo Decoding
7, JULY 2007
if
other cases
if
if (2)
otherwise
Based on the computation result from the first step, if
and , then . If this condition
Observe that only has 6 values: 2, 3, 5, 6, 7, and 19. We
is not satisfied, we need to find a minimum prime number
propose to gradually compute from . Assume
such that . A normal approach is to use binary
and . We have the following:
search. As the total number of prime numbers to be considered
is 52 (according to the WCDMA standard [2, Table II], has 52 if
possible values), we need to perform 6 multiplication operation,
otherwise.
6 memory accesses and 12 addition/subtraction operations to
determine the value in general. Since
In this brief, we consider an indirect computation approach. or depending on whether
Assume we store all values in a table (im- or not. Let , we can
plemented with a ROM, starting with address “0”). To address compute using (3). Fig. 2 shows the circuitry
the table for the target value, we calculate an approximate to compute from for any value of .
index , by using some simple mapping function. Here, we The basic strategy of this design is to take different number
construct such mapping function which guarantees the real of cycles to compute a new value for different . Specifically,
value to be stored in one of the four entries of the table indexed we take one cycle to compute , two cycles
by and for any and . If to computer and ,
then check if . If three cycles for and , four cycles for , and
then see if . After five cycles for . In case of , it takes five cycles
2 clock cycles, we will determine the index of target . Thus, per iteration to compute from (i.e., 5 cycles per
we can get value and value (the primitive root associated entry). During these five cycles, the register D0 sequentially
with prime number , see standard [2, Table II]) if we store outputs 2*
and corresponding in the same entry. The mapping function and . The register
used in the design is a piecewise-linear function, which can be D1 sequentially outputs
simply implemented with only add-and-shift operations. and .
Note, this circuit only requires 4 adders and 4 registers and some
III. COMPUTATION OF ARRAY AND ARRAY simple switching/multiplexing elements. The selection signals
of multiplexers and can be generated from a small
A. Computation of Array look-up table as indicated in Table I.
The array is computed as follows with : B. Computation of Array
The array is computed as follows according to the
(3)
standard [2]: Compute , such
Direct computation for array will inevitably involve mul- that and is a prime number,
tiplications and modulo operations, which not only raises the and where GCD stands
hardware cost, but also increases the computing delay. for great common divisor function.
638 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 54, NO. 7, JULY 2007
Directly computing GCD is a recursive process, and it is more dropping the LSB) per entry to save further computation in re-
complex than performing a few division operations. From sim- covering real value of .
ulation, we found out that the array is a subset of a group of
sequential prime numbers (i.e., ). In B. Storage of Values
particular, for any value of array is a subset of se-
As has only 6 values: 2, 3, 5, 6, 7, and 19. We use 3 bits
quential prime numbers (the first entries of ), where
to record the index of , i.e., 0 for 2, 1 for 3, 3 for 5, 4 for 6, 6
is the number of rows corresponding to the given value. As
for 7, and 7 for 19. Here we did not choose continuous indices,
array contains exactly elements, we need to record at most
the reason is that our selection is optimized for computation
two void indexes for each . These void indices, which can be
of array (refer to Section III-A). Specifically, the number of
easily identified from simulations, refer to the indexes of the se-
cycles per iteration in computing an array entry is now directly
quential prime number array whose entries that do not belong
related to the value of index (denoted as ) via a simple
to array. For instance, when
equation
is the fourth entry of prime array whereas not an
element of array. So the void index is 3. From our simulation,
(5)
we know the maximum void index is 20. Therefore, we need 5
bits to store one void index. If there is no void index, we store
where “ ” denotes right shift operation. For instance, when
0. There is only one case that has two void indices. When
, we need ( cycles to compute
; two prime num-
from while we need 5 cycles when .
bers, 7 and 17, are not included in the array. Hence, two void
As discussed in Section III, we need 5 bits to record void
indexes are 1 and 4. For this special case, we store 0 01 100
index for each . In all, we need bits per entry
(12 in decimal) in the table. Here since we know 12 is not a
(if we use 5 bits for each , then 13 bit per entry) and we have
void index for any other value of , we use it for this special
52 entries for the ROM.
case. Surely we could store another number that is not used for
any cases. But this proposed setting will lead to minimal hard-
ware cost since we can easily split 0 01 100 into 0 01 and V. ON-LINE COMPUTATION OF INTERLEAVE PATTERNS
0 100.
What we discussed before is the computation of the important
As will be shown in later discussions (Section V-B), what we
parameters ( array, and array), which are used to
really care is instead of itself. We intro-
compute the exact permutated address for each bit. Here we call
duce a Q ROM that has 22 entries and stores ,
the above process to calculate these parameters “Pre-computa-
i.e., the first entry stores 1, the second entry store , the
tion.” In this section, we discuss the method to compute valid
third entry stores , et al. We will use the following
interleave addresses one by one. We call this process “On-line
circuitry to recursive compute without intro-
computation.” In practice, we output one valid interleave ad-
ducing modulo operations.
dress almost every cycle.
It should be noted that the output of the circuit will be dropped
when a void index matches the running index .
A. Change of the Permutation Order
IV. STORAGE OF AND VOID INDICES OF According to the 3GPP standard, the online operation order
is: 1. intra-row permutation, 2.inter-row permutation, 3.read out
From the above discussions, it is clear that we need to store by column, and 4.prune invalid bit. However, for practical im-
52 sequential prime numbers , their corresponding v values, plementations, this order is not the most efficient order and in-
as well as void index/indices for corresponding array. Since troduces unnecessary hardware and computational complexity.
, A straightforward In later discussion, we proposed a method which is more effi-
way needs 19 bits per entry. However, some approaches can be cient for smaller hardware area and higher speed.
taken to save storage space. Suppose the input bit stream is , after in-
serting the dummy bits and written by row, it becomes
A. Storage of Values
The maximum prime number is 257, which requires 9 bits. If
we do not store the least significant bit, we need 8 bits to store .. .. ..
each value. If we store in the table, only 7 . . .
bits are needed per entry. In this case, we need one addition and
one shift operations to recover . A more aggressive approach
is to store in the table, where and we have the relation . If , then
denotes the index of the value in the table. From simulation, is a dummy bit.
we know takes values from 0 to 30. Thus, we only need 5 For the intra-row permutation, it calculates the parameter
bits to store for each . We can recover from as follows: , which is the original bit position of th permuted bit of
th row, as
(4) (6)
The above computation involves three additional operations where array is defined as
and one shift operation. In this design, we allocate 8 bits (by and is the inter-row permutation pattern pre-
WANG AND LI: VERY LOW-COMPLEXITY HARDWARE INTERLEAVER FOR TURBO DECODING 639
.. .. ..
. . .
where , such operation can be denoted as Fig. 3. Computation of Q[j ] mod (P 0 1).
loop from to
loop from to
TABLE III
CLOCK CYCLES COUNTS FOR DIFFERENT BLOCK SIZES