0% found this document useful (0 votes)
12 views5 pages

Very Low-Complexity Hardwareinterleaver For Turbo Decoding

This document presents a very low complexity hardware interleaver for turbo decoding in W-CDMA systems, utilizing algorithmic transformations to minimize computation complexity and latency. The proposed VLSI architecture significantly reduces hardware requirements, with the entire turbo interleave pattern generation unit consuming only 4 k gates, which is substantially less than conventional designs. The paper details various methods for computing parameters and optimizing storage to enhance performance while maintaining low power consumption.

Uploaded by

孙建鹏
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views5 pages

Very Low-Complexity Hardwareinterleaver For Turbo Decoding

This document presents a very low complexity hardware interleaver for turbo decoding in W-CDMA systems, utilizing algorithmic transformations to minimize computation complexity and latency. The proposed VLSI architecture significantly reduces hardware requirements, with the entire turbo interleave pattern generation unit consuming only 4 k gates, which is substantially less than conventional designs. The paper details various methods for computing parameters and optimizing storage to enhance performance while maintaining low power consumption.

Uploaded by

孙建鹏
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

636 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 54, NO.

7, JULY 2007

Very Low-Complexity Hardware


Interleaver for Turbo Decoding
Zhongfeng Wang, Senior Member, IEEE, and Qingwei Li

Abstract—This brief presents a very low complexity hardware


interleaver implementation for turbo code in wideband CDMA
(W-CDMA) systems. Algorithmic transformations are extensively
exploited to reduce the computation complexity and latency. Novel
VLSI architectures are developed. The hardware implementation
results show that an entire turbo interleave pattern generation
unit consumes only 4 k gates, which is an order of magnitude
smaller than conventional designs.
Index Terms—CDMA, interleaver, turbo codes, VLSI architec-
Fig. 1. (a) Turbo encoder structure. (b) Serial Turbo decoder structure.
ture.

amounts to approximately 30 K gates. A processor-based so-


I. INTRODUCTION lution presented in [6] used slightly more hardware while sup-
URBO code [1] invented in 1993 has been adopted in sev- porting turbo codes in CDMA2000 systems as well.
T eral industrial standards such as third generation CDMA
systems [2], [3] due to its outstanding performance. Fig. 1 shows
Approaches for low power implementation of digital signal
processing (DSP) systems have been addressed in many pa-
a turbo encoder structure and a serial turbo decoder structure, pers such as [7]. In this brief, we maximally exploit joint algo-
where and stand for source information bit, sys- rithm level, architecture level, and circuit level VLSI optimiza-
tematic bit, parity bit-1 and parity bit-2, respectively; tion approaches to eliminate costly multiplications, divisions,
and represent received soft symbols corresponding to and modulo operations, in order to reduce the overall computa-
and , respectively; RSC stands for recursive systematic con- tion complexity, computing latency and power consumption of
volutional encoder; the soft-input soft-output (SISO) decoder the target system. Our implementation results show that the pro-
outputs two soft messages at each time instance: the log like- posed design has an order of magnitude lower hardware com-
lihood ratio and the extrinsic information . One plexity than other published designs.
of the key features of turbo code is the interleaver. At the en- This brief is organized as follows. In Section II, the approach
coder side, a block of information bits are interleaved and sent to compute the basic parameters is described. In Section III,
to RSC2 to generate parity bit-2. At the decoder side, the ex- we introduce two new methods to compute and arrays,
trinsic information and symbols are interleaved at the second which are the two most complex parts in this interleaver. Then
decoding phase [4]. In practical implementation, the interleave in Section IV, we briefly present some ways to save storage
process is performed by reading data in the interleaved order spaces. In Section V, we propose to change the permutation
(note: the de-interleaving process is completed by writing data order, which can save some computation hardware and get rid
back to where they were loaded from. In this way, no de-in- of the delay of traditional permutation method. Section VI illus-
terleave patterns are required). Therefore, an interleave pattern trates the VLSI design details and provides the implementation
generation circuitry is needed, which serves as an address gen- report. Finally, conclusions are drawn in Section VII. It should
erator as shown in Fig. 1. be mentioned that the idea of on-the-fly address generation by
In WCDMA systems, turbo code block size varies from 40 changing the permutation order discussed in Section V is sim-
to 5114 bits. Different block sizes require different interleave ilar to the approach proposed in [6], though the new work was
patterns. It can be derived that a ROM-based solution requires independently developed.
more than 100 M bits of storage for all the interleave patterns, Most of the variables used in the following discussion are
which is unacceptable from the hardware cost point of view. In matching the symbols used in standard [2]. ( matches for
[5], a hardware interleaver solution was proposed by researchers in the standard, and, for for for for for
in Cornell Broadband Communication Lab. The total hardware for for ). For the detailed definition of each variable,
please refer to [2].
Manuscript received November 29, 2006; revised January 30, 2007. Some
information in this paper may be covered in a patent application. This paper II. COMPUTATION OF BASIC PARAMETERS
was recommended by Associate Editor S. Tsukiyama.
Z. Wang was with Morphics Technology Inc, Campbell, CA 95008 USA. He
is now with the School of Electrical Engineering and Computer Science, Oregon
A. Computation of R
State University, Corvallis, OR 97331 USA (e-mail: [email protected]).
Q. Li is with the School of Electrical Engineering and Computer Science,
Oregon State University, Corvallis, OR 97331 USA (e-mail: [email protected].
if
edu). if or (1)
Digital Object Identifier 10.1109/TCSII.2007.895313 other
1549-7747/$25.00 © 2007 IEEE
WANG AND LI: VERY LOW-COMPLEXITY HARDWARE INTERLEAVER FOR TURBO DECODING 637

The number of rows is computed using (1), where is the


input parameter representing the block size. For the goal of low
complexity, we use 3 cycles to make 3 comparisons: Cycle-1:
check if , the decision bit is denoted as
means ; Cycle-2: check if , the decision bit
is ; Cycle-3: check if the answer from Cycle-2
is “yes”, otherwise check if , the decision bit is de-
noted as . The final value of can be determined using a
simple combinational logic based on the above three decision
bits. To reduce the complexity of forthcoming computation, we
only record the index of for for and 2 for
. Thus, we only need a 3-bit input and 2-bit output logic
to determine index.
Fig. 2. Circuitry for computation of S array.
B. Calculation of and TABLE I
The second step is to determine the prime number and the MUX SELECT VALUES AT DIFFERENT CLOCK CYCLES
number of columns

if
other cases
if
if (2)
otherwise
Based on the computation result from the first step, if
and , then . If this condition
Observe that only has 6 values: 2, 3, 5, 6, 7, and 19. We
is not satisfied, we need to find a minimum prime number
propose to gradually compute from . Assume
such that . A normal approach is to use binary
and . We have the following:
search. As the total number of prime numbers to be considered
is 52 (according to the WCDMA standard [2, Table II], has 52 if
possible values), we need to perform 6 multiplication operation,
otherwise.
6 memory accesses and 12 addition/subtraction operations to
determine the value in general. Since
In this brief, we consider an indirect computation approach. or depending on whether
Assume we store all values in a table (im- or not. Let , we can
plemented with a ROM, starting with address “0”). To address compute using (3). Fig. 2 shows the circuitry
the table for the target value, we calculate an approximate to compute from for any value of .
index , by using some simple mapping function. Here, we The basic strategy of this design is to take different number
construct such mapping function which guarantees the real of cycles to compute a new value for different . Specifically,
value to be stored in one of the four entries of the table indexed we take one cycle to compute , two cycles
by and for any and . If to computer and ,
then check if . If three cycles for and , four cycles for , and
then see if . After five cycles for . In case of , it takes five cycles
2 clock cycles, we will determine the index of target . Thus, per iteration to compute from (i.e., 5 cycles per
we can get value and value (the primitive root associated entry). During these five cycles, the register D0 sequentially
with prime number , see standard [2, Table II]) if we store outputs 2*
and corresponding in the same entry. The mapping function and . The register
used in the design is a piecewise-linear function, which can be D1 sequentially outputs
simply implemented with only add-and-shift operations. and .
Note, this circuit only requires 4 adders and 4 registers and some
III. COMPUTATION OF ARRAY AND ARRAY simple switching/multiplexing elements. The selection signals
of multiplexers and can be generated from a small
A. Computation of Array look-up table as indicated in Table I.
The array is computed as follows with : B. Computation of Array
The array is computed as follows according to the
(3)
standard [2]: Compute , such
Direct computation for array will inevitably involve mul- that and is a prime number,
tiplications and modulo operations, which not only raises the and where GCD stands
hardware cost, but also increases the computing delay. for great common divisor function.
638 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 54, NO. 7, JULY 2007

Directly computing GCD is a recursive process, and it is more dropping the LSB) per entry to save further computation in re-
complex than performing a few division operations. From sim- covering real value of .
ulation, we found out that the array is a subset of a group of
sequential prime numbers (i.e., ). In B. Storage of Values
particular, for any value of array is a subset of se-
As has only 6 values: 2, 3, 5, 6, 7, and 19. We use 3 bits
quential prime numbers (the first entries of ), where
to record the index of , i.e., 0 for 2, 1 for 3, 3 for 5, 4 for 6, 6
is the number of rows corresponding to the given value. As
for 7, and 7 for 19. Here we did not choose continuous indices,
array contains exactly elements, we need to record at most
the reason is that our selection is optimized for computation
two void indexes for each . These void indices, which can be
of array (refer to Section III-A). Specifically, the number of
easily identified from simulations, refer to the indexes of the se-
cycles per iteration in computing an array entry is now directly
quential prime number array whose entries that do not belong
related to the value of index (denoted as ) via a simple
to array. For instance, when
equation
is the fourth entry of prime array whereas not an
element of array. So the void index is 3. From our simulation,
(5)
we know the maximum void index is 20. Therefore, we need 5
bits to store one void index. If there is no void index, we store
where “ ” denotes right shift operation. For instance, when
0. There is only one case that has two void indices. When
, we need ( cycles to compute
; two prime num-
from while we need 5 cycles when .
bers, 7 and 17, are not included in the array. Hence, two void
As discussed in Section III, we need 5 bits to record void
indexes are 1 and 4. For this special case, we store 0 01 100
index for each . In all, we need bits per entry
(12 in decimal) in the table. Here since we know 12 is not a
(if we use 5 bits for each , then 13 bit per entry) and we have
void index for any other value of , we use it for this special
52 entries for the ROM.
case. Surely we could store another number that is not used for
any cases. But this proposed setting will lead to minimal hard-
ware cost since we can easily split 0 01 100 into 0 01 and V. ON-LINE COMPUTATION OF INTERLEAVE PATTERNS
0 100.
What we discussed before is the computation of the important
As will be shown in later discussions (Section V-B), what we
parameters ( array, and array), which are used to
really care is instead of itself. We intro-
compute the exact permutated address for each bit. Here we call
duce a Q ROM that has 22 entries and stores ,
the above process to calculate these parameters “Pre-computa-
i.e., the first entry stores 1, the second entry store , the
tion.” In this section, we discuss the method to compute valid
third entry stores , et al. We will use the following
interleave addresses one by one. We call this process “On-line
circuitry to recursive compute without intro-
computation.” In practice, we output one valid interleave ad-
ducing modulo operations.
dress almost every cycle.
It should be noted that the output of the circuit will be dropped
when a void index matches the running index .
A. Change of the Permutation Order

IV. STORAGE OF AND VOID INDICES OF According to the 3GPP standard, the online operation order
is: 1. intra-row permutation, 2.inter-row permutation, 3.read out
From the above discussions, it is clear that we need to store by column, and 4.prune invalid bit. However, for practical im-
52 sequential prime numbers , their corresponding v values, plementations, this order is not the most efficient order and in-
as well as void index/indices for corresponding array. Since troduces unnecessary hardware and computational complexity.
, A straightforward In later discussion, we proposed a method which is more effi-
way needs 19 bits per entry. However, some approaches can be cient for smaller hardware area and higher speed.
taken to save storage space. Suppose the input bit stream is , after in-
serting the dummy bits and written by row, it becomes
A. Storage of Values
The maximum prime number is 257, which requires 9 bits. If
we do not store the least significant bit, we need 8 bits to store .. .. ..
each value. If we store in the table, only 7 . . .
bits are needed per entry. In this case, we need one addition and
one shift operations to recover . A more aggressive approach
is to store in the table, where and we have the relation . If , then
denotes the index of the value in the table. From simulation, is a dummy bit.
we know takes values from 0 to 30. Thus, we only need 5 For the intra-row permutation, it calculates the parameter
bits to store for each . We can recover from as follows: , which is the original bit position of th permuted bit of
th row, as
(4) (6)
The above computation involves three additional operations where array is defined as
and one shift operation. In this design, we allocate 8 bits (by and is the inter-row permutation pattern pre-
WANG AND LI: VERY LOW-COMPLEXITY HARDWARE INTERLEAVER FOR TURBO DECODING 639

defined according to different values [2]. After the intra-row


permutation, the pattern becomes

.. .. ..
. . .

where , such operation can be denoted as Fig. 3. Computation of Q[j ] mod (P 0 1).
loop from to
loop from to

The inter-row permutation permutes the rows according to


, where is the original position of the th permuted
row. After inter-row permutation, the pattern becomes Fig. 4. Computation of the index of S array.

order, the calculation of becomes computing . Ac-


.. .. .. cording to (6), we have
. . .

and , so the (7)


inter-row permutation can be expressed as Therefore, another benefit of our transformation is: we avoid the
step of computing the array, which saves both the computation
loop from to time and the RAM needed to store the values. It should be
loop from to mentioned that the processor-based turbo interleaver design in
[6] changed the permutation order in a similar way to reduce
complexity.
Therefore, the intra-row and inter-row permutation can be com-
bined as B. Computation of Index to Array
loop from to Notice in (7), to compute
loop from to is needed, here denotes index. Fig. 4 demonstrates the
circuitry we used to compute index. To avoid the multiplica-
tion, we recursively calculate from

After the permutations, the sequence is output by


column, (8)
. A straightforward method is to cal-
culate matrix first, store the values, and then read it out by Thus, from this equation, the values of array need to modulo
column. However, a wiser way is to do the permutation using before they are saved and used. This is the reason why we
as outer loop, as inner loop. Then the calculation sequence of design the circuit in Fig. 3 to compute . It should be pointed
is the same as output sequence. In this way, we can calculate out that similar incremental computation to (8) has been seen in
an address, check if it is valid (not a dummy bit), and output it. [6]. However, this part of work was independently developed by
Thereby, no extra storage space for and no latency introduced the first author.
here.
Our method can be denoted as VI. VLSI IMPLEMENTATION

loop from to A. State Diagram


loop from to Please refer to Fig. 5.
compute
B. Overall Block Diagram
address
The overall block diagram of the turbo interleaver address
if address output address
generator is shown in Fig. 6, where is the turbo code block
We used this faster method in our implementations. Another size. Task begin signal indicates the start of computation, and
thing worthy to be noticed is: after the change of permutation task kill signal forces the task to stop and return to the idle
640 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 54, NO. 7, JULY 2007

TABLE III
CLOCK CYCLES COUNTS FOR DIFFERENT BLOCK SIZES

language. We have simulated and verified the design logic


by comparing the output results to the C program. We have
Fig. 5. State graph for the interleave address generator.
performed synthesis, optimization and place & route. The
synthesis was targeted at SMIC 0.18- m standard CMOS
technology. The optimization goal was set as area. The total
hardware cost is approximately 4.03 gates. The maximum
clock frequency is 130 MHz. The required clock frequency
is MHz, where we assume six iterations
are performed for turbo decoding. This means that the real
critical path in our design is significantly shorter than required.
Thus, we can use significantly lower supply voltage to drive the
circuit in order to quadratically reduce the power consumption.
In brief, compared to the designs presented in [5] and [6], the
proposed design is an order of magnitude more efficient in both
area and power.

Fig. 6. Overall block diagram of the interleave address generator.


VII. CONCLUSION
TABLE II
AREA COMPARISON OF INTERLEAVER IMPLEMENTATIONS
In this brief, we have presented a novel hardware interleaver
architecture and implementation for 3 G WCDMA system.
Various optimization techniques, specifically judicious algo-
rithmic transformations and novel VLSI architectures, have
been introduced and applied to this design. The implementation
results demonstrate the benefits of these techniques, and show
this design is an order of magnitude more efficient than the
prior arts.

state. Signal “address” is the interleaved address output. “ad- REFERENCES


dress valid” indicates whether the output address comes from [1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit
dummy bit or not. “Index” stands for the address of which input error-correcting coding and decoding: Turbo-codes,” in Proc. IEEE
ICC’93, May 1993, vol. 2, pp. 1064–1070.
bit is calculated. For example, de- [2] Technical Specification Group Radio Access Network; Multiplexing
notes that the first bit will be interleaved to the 34th position. and Channel Coding (FDD), 3GPP TS25.212 v5.1.0, 3rd Generation
RAM, array, and S array are RAMs used to store Partnership Project, 2002.
array, array, and array, respectively. T ROM stores [3] Physical Layer Standard for CDMA2000 Spread Spectrum Systems,
3GPP2 C.S0002-C, v1.0, 3rd Generation Partnership Project 2, 2002.
inter-row permutation pattern T. pvQ ROM stores values of [4] Z. Wang, H. Suzuki, and K. Parhi, “VLSI implementation issues of
, and void index of array. Q ROM stores the 22 differ- turbo decoder design for wireless applications,” in Proc. IEEE Work-
ential prime number sequences. The Computing Core contains shop Signal Process. Syst. (SiPS), 1999, pp. 503–512.
[5] P. Ampadu and K. Kornegay, “An efficient hardware interleaver for 3
the main finite state machine and other combinational and se- G turbo decoding,” Proc. RAWCON’03, pp. 199–201, Aug. 2003.
quential logics to conduct control and computation. [6] M. Shin and I.-C. Park, “Processor-based turbo interleaver for multiple
thrid-generation wireless standards,” IEEE Commun. Lett., vol. 7, no.
C. Implementation Results 5, pp. 210–12, May 2003.
[7] K. Parhi, “Approaches to low-power implementations of DSP systems,”
Please refer to Tables II and III. The architectures discussed IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 48, no. 10, pp.
above have been modeled using Verilog hardware description 1214–1224, Oct. 2001.

You might also like