0% found this document useful (0 votes)
78 views

VLSI Architectures For Iterative Decoders in Magnetic Recording Channels

This paper analyzes the requirements for computational hardware and memory, and provides suggestions for reduced-complexity decoding and reduced control logic. SISO decoders promise significant bit error performance advantage over conventionally used partial-response maximum likelihood (PRML) systems.

Uploaded by

tarjtharu
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

VLSI Architectures For Iterative Decoders in Magnetic Recording Channels

This paper analyzes the requirements for computational hardware and memory, and provides suggestions for reduced-complexity decoding and reduced control logic. SISO decoders promise significant bit error performance advantage over conventionally used partial-response maximum likelihood (PRML) systems.

Uploaded by

tarjtharu
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

748 IEEE TRANSACTIONS ON MAGNETICS, VOL. 37, NO.

2, MARCH 2001

VLSI Architectures for Iterative Decoders in


Magnetic Recording Channels
Engling Yeo, Student Member, IEEE, Payam Pakzad, Borivoje Nikolić, Member, IEEE, and
Venkat Anantharam, Fellow, IEEE

Abstract—VLSI implementation complexities of soft-input


soft-output (SISO) decoders are discussed. These decoders are
used in iterative algorithms based on Turbo codes or Low Density
Parity Check (LDPC) codes, and promise significant bit error
performance advantage over conventionally used partial-response
maximum likelihood (PRML) systems, at the expense of increased
complexity. This paper analyzes the requirements for compu- Fig. 1. Serially concatenated turbo encoder with a convolutional outer code.
tational hardware and memory, and provides suggestions for
reduced-complexity decoding and reduced control logic. Serial
concatenation of interleaved codes, using an outer block code with
a partial response channel acting as an inner encoder, is of special
interest for magnetic storage applications.
Index Terms—Iterative decoders, LDPC codes, magnetic
recording, turbo codes, VLSI architectures.

I. INTRODUCTION Fig. 2. Iterative decoder using SISO decoders separated by interleavers.

A TURBO encoder using serial concatenation of a con-


volutional code or Low Density Parity Check (LDPC)
code with a partial-response channel acting as the inner coder
is shown in Fig. 1 [1]. The iterative decoder (Fig. 2) uses a
combination of soft-input-soft-output (SISO) decoders sep-
arated by interleavers, , and the inverse, . We present
SISO decoder implementations that employ either the MAP
Algorithm (BCJR) [2], Soft Output Viterbi Algorithm (SOVA)
[3], or the LDPC decoding algorithm [5].
All systems considered in this paper assume an par-
tial response channel. The particular partial response target is Fig. 3. Pipelined decoder for serially concatenated turbo codes using outer
not essential to the following discussion, and is used decoder D1 and inner decoder D2 separated by interleavers/deinterleavers,
as an example because it presents a complexity equivalent to = .
contemporary read channel detectors. The outer code is either
a 16-state binary convolutional code or an LDPC code, imple- In order to achieve desired throughputs (above 1 Gbps) that
menting a rate 8/9 coding. As is common with most magnetic are in line with current trends in magnetic recording systems, a
recording channels, the use of block codes and interleaver de- fully unrolled and pipelined architecture [6] is needed (Fig. 3).
sign is restricted to a sector size of 4096 user bits. The number of This results in a linear complexity increase with the number of
bits used to represent the log-likelihood ratios or messages is a iterations.
tradeoff between the amount of hardware required and the BER In the following sections, structures for the building blocks
performance of the iterative decoder. Earlier systems using 4 to of an iterative decoder will be analyzed. Section II discusses the
6-bit representations [6], [7] have reported good performance implementation of an interleaver and deinterleaver. Section III
with respect to floating-point results. discusses a MAP decoder implementing the Windowed-BCJR
algorithm, using a minimal number of Add–Compare–Select
units and a highly regular memory access pattern. A realization
Manuscript received June 20, 2000; revised Oct 11, 2000. The work of P. of a SOVA decoder by a simple extension of the register ex-
Pakzad and V. Anantharam was supported by the National Science Foundation
through Awards IRI-97-12131 and SBR-9873086, and by ONR MURI through change method is presented in Section IV. Section V discusses
Award N00014-1-0637 on “Decision Making Under Uncertainty.” a pipelined LDPC decoder and proposes a message arrangement
The authors are with the Electrical Engineering and Computer Sciences in memory that lowers the complexity for address decoding.
Department, University of California, Berkeley, CA 94720 USA (e-mail:
[email protected]). Section VI compares the results and Section VII provides some
Publisher Item Identifier S 0018-9464(01)02388-3. concluding remarks.
0018–9464/01$10.00 © 2001 IEEE
YEO et al.: VLSI ARCHITECTURES FOR ITERATIVE DECODERS IN MAGNETIC RECORDING CHANNELS 749

Fig. 4. Interleavers and deinterleavers implemented using alternating


read/write buffers.

II. INTERLEAVER
The randomness of the interleaver output sequence makes Fig. 5. Add-compare-select unit for an iterator (either forward or backward)
it difficult to realize in-place storage. A direct interleaver 9
using the (.) operator as indicated within the box.
implementation uses two banks of buffers alternating between
read/write for consecutive sectors of data (Fig. 4). The latency 3) Depending on position (inner/outer) of decoder, the
through an interleaver is therefore equal to the block size. required a posteriori probability, is either
The basic block interleaver design uses a minimal amount of or respectively.
control logic. Using static random-access memory (SRAM) for
high-speed implementation, the interleaver inputs are written
row-wise in the memory array, while outputs are read column-
wise. For a block interleaver of size arranged as an by
matrix, such that , this assures that bits located
within a distance of before interleaving are separated (4)
by a minimum distance of after interleaving. The sequential
write/read pattern along rows/columns allows the memory ac- The structures for both forward and backward iterations are
cess operations of this interleaver to make use of cycle counters identical, and similar to the Add–Compare–Select units used in
to activate both word (row) lines and bit (column) lines, thereby Viterbi decoders. Thus only the forward iterator (Fig. 5) will
eliminating the necessity to perform memory-address decoding. be described. The current branch metrics ( ) are added to the
More sophisticated interleaver designs [8], [9] yield improved corresponding state metrics ( ) from the previous iteration:
error rate performance, but result in increased implementation
complexity. Therefore, the implementation of the described
(5)
basic interleaver provides a lower limit on complexity.
The logarithm of the sum of exponentials is then evaluated with
III. MAP DECODER a new operator, . It uses a comparator, a lookup table, and
A MAP decoder implements the BCJR [2] algorithm. It is a final adder (Fig. 5) to approximate the second term in the
used to obtain the a posteriori information for partial response equation [10]:
channel decoding, as well as outer decoding when a convolu-
tional code is employed as the outer code. Given the prior prob-
(6)
abilities, , and channel likelihood estimates, , the
The forward/backward iteration structures are therefore termed
log-domain computations of the BCJR algorithm are divided
the Add-Compare-Select-Add (ACSA) units. A number of
into three groups:
operators are also used in the computation of the a posteriori
1) Branch metric computation for each branch between values using a tree structure shown in Fig. 8.
states to : To implement the original BCJR algorithm, the backward it-
eration can only begin after complete observation of the block
of 4k bits, resulting in large memory requirements and long la-
(1)
tencies. Variations of the BCJR algorithm avoid these effects by
2) Forward/Backward iteration for each state, , assuming windowing or limiting the number of backward iteration steps.
a radix-2 trellis: Forward state metric; valid transitions are
A. Backward Propagation of Windowed BCJR
( ), ( ):
An implementation of windowed BCJR with asymptotically
(2) equivalent performance can be achieved using two overlapping
windows for the -computation. Each window spans a width of
Backward state metric; valid transitions are ( ), , and overlaps with the other window in both trellis position
( ): and time by steps, as shown in Fig. 6. The initial outputs are
always discarded while the latter outputs, having satisfied a
criterion for minimum number of steps, , through the trellis,
(3)
are retained and eventually combined with the appropriate
750 IEEE TRANSACTIONS ON MAGNETICS, VOL. 37, NO. 2, MARCH 2001

Fig. 8. 3 block makes use of a binary tree of 9 (.) operators.

Fig. 6. Backward iteration using 2 overlapping windows, W and W for


BCJR algorithm. The shaded outputs are always discarded.

Fig. 9. Memory read and write access of branch metrics .


Fig. 7. State-slice of a MAP decoder structure.

random access memory, and is implemented as a bi-directional


values to obtain the soft outputs. This scheme results in lower
shift register.
memory requirement and less computational hardware.
Similarly, observations on the production and consumption
Fig. 7 shows a state-slice of the MAP decoder that is able
patterns of the values will indicate that each -Memory block
to maintain a throughput equal to the input arrival rate. The
can be implemented with a bi-directional -word shift register.
-memory stores the branch metrics. An -ACSA performs the
Finally, evaluation of (4), the a posteriori result, is performed
forward iteration and stores its outputs in the -Memory. Two
by summing the and values in a tree structure (Fig. 8), and
-ACSA’s perform the backward iteration in accordance with
a final adder evaluates the log-likelihood ratios.
the overlapping window method.
Although the maximum latency through each MAP decoder
Fig. 9 reproduces the timing diagram of a scheme that would
is (80 cycles for ), it remains insignificant compared
limit the interval between the production and 3 consumption
with that of the interleaver discussed in Section II.
cycles to . The implementation partitions each -memory
block into 3 sections of length (3 sections of columns in
Fig. 9) and deliberately delays the first forward iteration by . IV. SOFT OUTPUT VITERBI ALGORITHM (SOVA) DECODER
New data is cyclically written into each of the partitions while
the write/read access pattern within each partition is continu- The computational complexity of the BCJR algorithm can
ously alternated between left-to-right and right-to-left directions be traded for reduced BER performance by replacing the MAP
every periods. Each branch metric entry, , in memory is decoders with SOVA decoders [3].
read once by each of the three ACSA’s. After the third and final As in BCJR, a windowed SOVA is advantageous in terms of
read access, the memory location is immediately replaced with its memory requirement and latency, when compared to the full
new data. The repetitive nature of the memory access within SOVA. Previous windowed SOVA implementations [12] made
each partition promotes reduction in control logic, compared to use of a two-step algorithm (Fig. 10). The first stage is a regular
YEO et al.: VLSI ARCHITECTURES FOR ITERATIVE DECODERS IN MAGNETIC RECORDING CHANNELS 751

Fig. 10. Realization of a SOVA decoder by cascading a typical VA survival


memory unit with a SOVA section. and are the two most likely paths that
will arrive at state m .

Fig. 12. Example 4-state SOVA-Register Exchange Survival Memory Unit


with Compare-and-Mux (CAM) units to perform equivalence checking and
multiplexing. The outputs EQ indicate equality of decisions taken at step
i, state j , and traceback depth k.

Fig. 11. Example 4-state system block for SOVA.

-step windowed-Viterbi algorithm (VA) that obtains the most


likely state, , with a delay of . This is followed by another
-step traceback to find the two most likely paths arriving at
. Tracebacks are performed by recursively reading interme-
diate decisions that were stored in an SRAM.
The SRAM-based traceback has a costly implementation Fig. 13. Block diagram of Compare-and-Mux (CAM) unit comprising an
XOR gate for equality checking and multiplexing function.
complexity due to address decoder and sense amplifier over-
head. An implementation of SOVA combining the efficiency
of a register exchange pipeline with the two-stage SOVA is Compared with a hard-output Viterbi decoder implemen-
presented in Fig. 11. tation, the total size of the SMU’s is approximately doubled
From each of the ACS’s, the difference between the two path (assuming the difference between and is small). The RMU
metrics, , arriving at time , state is retained. Additionally, overhead consists of copies of 1 register, 2 multiplexers and
a modified register exchange (Fig. 12) provides EQ out- a 2-input comparator performing the minimization function.
puts indicating the equality between the competing decisions The latency through the SOVA decoder is . This re-
at time , state , from which a traceback of depth is mains insignificant compared with the overall latency in the
initiated. Using the decisions from the VA-SMU, ’s and Turbo-SOVA system, which is dominated by the latency through
EQ ’s corresponding to the most likely state, , are the interleavers.
multiplexed into the Reliability Measure Unit (RMU), which
uses comparators (minimizing function) and multiplexers in a V. LOW DENSITY PARITY CHECK CODE DECODERS [4]
pipeline to select the minimum along the most-likely path. An LDPC code with a parity check matrix will
The pipeline is initialized with the maximum reliability mea- be used as an example outer code. This parity check matrix has
sure allowed by the particular binary representation (conceptu- columns of weight 4 and rows of weight 36, and comprises a
ally represented as “ ” in Fig. 11). Based on the EQ input, each total of nonzero entries. The parity check
pipelined section outputs one of the following: information is also commonly represented in a bi-partite graph
1) Equal decision—reliability measure from the previous as 512 check nodes and 4608 bit nodes.
step. The following log-domain equations modified from [5]
2) Different decision—minimum of and reliability mea- exploit the large number of common terms in each group of
sure from the previous step. computations.
752 IEEE TRANSACTIONS ON MAGNETICS, VOL. 37, NO. 2, MARCH 2001

Fig. 15. Binary adder tree to compute bit-to-check message in a 2-stage


Fig. 14. Recursive pipelined implementation to compute check-to-bit pipeline.
messages.

MSB of all inputs. The result is fed into the output LUT to direct
1) Check to bit messaging (parity check):
an output with the appropriate sign. In addition, the final
lookup table could be precoded to account for the deterministic
term, , in (7).

B. Bit to Check Message Computation


sgn sgn Each of the 4608 bit nodes in the example LDPC decoder
computes using entries from 4 different check nodes
(7) , , and . The bottleneck is the 4-input summation:
.
where and are evaluated using lookup-tables: Unlike the earlier 36-input summation, the small number of
inputs makes it very suitable for a pipelined tree adder structure
as shown in Fig. 15. With a steady state throughput of four
(8) messages per cycle, the total latency to compute all the messages
is again approximately the same time it would have taken to
A simple expansion of these terms will show that acquire a new block of inputs.
, implying that the implemented
lookup-tables are identical. C. LDPC Memory Design
2) Bit to check messaging:
While the computational complexity of an LDPC decoder
is very low compared with the MAP or SOVA decoders, the
(9) memory requirement far exceeds those of the latter two. Due
to the irregularity in the parity check matrix, the two classes
of computations over a single block of inputs, bit-to-check and
where is the prior information for bit . check-to-bit, cannot be overlapped. In order to achieve fully
pipelined throughputs, each memory block in the LDPC decoder
A. Check to Bit Message Computation is implemented as two buffers alternating between read/write.
The example LDPC block code has a total of 512 parity Thus, for a single iteration of LDPC decoding (bit-to-check and
checks, where each parity check computes using entries vice-versa) the required memory is words.
from 36 bit nodes . The bottleneck in (7) is This section provides a proposal for structural indexing of the
the 36-input summation, . messages to simplify the control logic.
In order to maintain a high throughput with a small number The messages are indexed in a 2-D array with indices
of read ports, the computation is performed with copies of that are ordered sequentially in the -direction and strictly in-
structures identical to Fig. 14. They are cascaded in parallel to creasing indices in the -direction. Fig. 17 shows an example
achieve a throughput of messages per cycle. A natural matrix of messages. In general, the mes-
choice for the value of would be the column weight (4 in our sages are not consumed in any particular order along the in-
example), such that all check-to-bit messages are computed in dices (or -direction). With the described arrangement though,
approximately the same time it would have taken to acquire a entries along each row are consumed in a strictly left-to-right
new block of inputs. manner. Thus each 36-entry row of the matrix is stored in a
Using 2’s complement representation, the most-significant first-in-first-out (FIFO) buffer, which also removes the require-
bit (MSB) of the messages is a sign bit. Therefore, the product, ment for the computation block to keep track of the column
sgn , is equal to a collective XOR of the index. Inputs are simply indexed by their row numbers, and
YEO et al.: VLSI ARCHITECTURES FOR ITERATIVE DECODERS IN MAGNETIC RECORDING CHANNELS 753

Fig. 18. Example 4608 2 4 memory array for Q values.

Fig. 16. Using FIFO’s to store rows of R messages.

Fig. 19. Using 4-input stacks to store rows of Q messages.

VI. COMPARISON OF SISO DECODERS


The number of computational units required for each of the it-
erative decoder modules is summarized in Table I. An
channel decoder is concatenated with either a 16-state binary
convolutional decoder or an LDPC decoder. As described in
Section IV, the BCJR decoder can be replaced with a SOVA de-
Fig. 17. Example 512 2 36 memory array for R values.
coder. The number of ACSA units is reduced from 3 to 1 per
state-slice, a 66% savings in structural computation units, while
memory savings is 30%. The SOVA algorithm trades off com-
each read port can therefore be implemented as a 512-input plexity for predictable degradation in BER performance over the
multiplexer. BCJR algorithm.
Fig. 16 shows that each of the 4 parity check blocks outputs The throughputs of both the MAP and SOVA decoders are
to a quarter of all the FIFO’s. The demultiplexer-select is incre- limited by the feedback loop that exists in the Add–Compare–
mented once every 36 cycles to switch to the next FIFO, which Select units. If area and power were not constrained, a 1 Gbps
stores the next row in the matrix. iterative decoder based on MAP or SOVA decoding would be
Similarly, is indexed as shown in Fig. 18, with i indices achievable with current technology; however, due to mandatory
that are ordered sequentially in the -direction and strictly in- unrolling and pipelining, such a decoder will be between 10 and
creasing indices in the -direction. Each row of the ma- 15 times the area and power of any existing decoder implemen-
trix is stored in a 4-entry stack. The 4 messages are produced tation based on conventional Viterbi sequence detection.
simultaneously by the tree adder structure described previously, On the other hand, the proposed LDPC decoder system is
but consumed in a strictly left-to-right manner. The subsequent strictly feedforward; therefore, introducing additional levels of
parity check ( ) computations only need to keep track of pipelining can alleviate delay issues, at the expense of register
the row numbers to pop the correct values from the appropriate area and negligible latency. It has been widely recognized that
stack through a 4608-input multiplexer. A pipelined multiplexer LDPC decoders enjoy significant advantage in terms of com-
is necessary in order to meet Gbps throughtputs with such a putational complexity compared to the trellis-based decoding in
large number of inputs. MAP and SOVA decoders. This characteristic is reflected in our
754 IEEE TRANSACTIONS ON MAGNETICS, VOL. 37, NO. 2, MARCH 2001

TABLE I
COMPUTATIONAL UNITS AND MEMORY REQUIREMENTS FOR ITERATIVE DECODER MODULES

proposed implementation, which uses a small number of com- opportunity for reduced complexity implementations. Since
putational units: 16 adders and 8 LUT’s. decisions become increasingly confident after each stage,
However, the lack of any structural regularity in the parity decoders that are later in the pipeline can trade off some BER
check matrix results in memory requirements that are 2 performance for reduced complexities. A number of choices are
orders of magnitude larger than those in the MAP or SOVA available, ranging from replacing MAP decoders with SOVA
decoders. It was shown by example in Section V-C that a single decoders or using shorter window lengths to trellis pruning in
LDPC iteration would have a memory requirement upwards of the trellis-based decoders [15].
73 000 words. To make an LDPC decoder implementation more The immediate difficulty with LDPC decoders lies in the
feasible, it will be necessary to introduce regularity into the memory requirement, which should be addressed by designing
parity check matrix. Recent publications, [13], [14], suggesting structured LDPC codes. Without removing the memory bot-
the construction of LDPC-like codes based on difference-set tleneck, further reduced-complexity LDPC decoding, such as
cyclic codes may provide the necessary foundation for building approximating the summations in (7) and (9) with minimum
a practical LDPC decoder with reduced memory requirement. and maximum functions respectively, would have little effect
The memory problem is not restricted to LDPC decoders. In- on the overall decoder implementation.
terleavers, which are necessary between concatenated convolu-
tion decoders, also require significant memory due to the ran-
REFERENCES
domness of the output sequences. Interleavers that allow some
form of ordered permutation and compact representation will [1] T. Souvignier, M. Oberg, P. Siegel, R. Swanson, and J. Wolf, “Turbo
decoding for partial response channels,” IEEE Trans. Commun., vol. 48,
permit efficient implementations of Turbo decoders with no per- no. 8, Aug. 2000.
formance loss. [2] L. R. Bahl, J. Cocke, F. Jelinek, and J. Rajiv, “Optimal decoding of linear
Finally, an iterative decoder implementation for magnetic codes for minimizing symbol error rate,” IEEE Trans. Inform. Theory,
vol. IT-20, pp. 284–287, Mar. 1974.
storage application requires timing recovery methods that can [3] J. Hagenauer and L. Papke, “Decoding turbo codes with the soft output
tolerate the increased latencies through multiple decoding viterbi algorithm (SOVA),” in Proc. IEEE ISIT 1994, Trondheim,
iterations. Norway, June 1994, p. 164.
[4] R. G. Gallager, “Low density parity check codes,” IRE Trans. Inform.
Theory, vol. IT-8, pp. 21–28, Jan. 1962.
VII. CONCLUSION [5] J. Fan and J. Cioffi, “Constrained coding techniques for soft iterative
decoders,” in Proc. GLOBECOM ’99, vol. 16, Rio de Janeiro, Brazil,
We have proposed datapath-intensive architectures as well as Dec. 1999, pp. 723–727.
timing and data arrangement schedules for each kind of SISO [6] G. Masera, G. Piccinini, M. Roch, and M. Zamboni, “VLSI architectures
for turbo codes,” IEEE Trans. VLSI Systems, vol. 7, no. 3, Sept. 1999.
decoder in order to minimize the critical path delay and simplify [7] Y. Wu and B. Woerner, “The influence of quantization and fixed point
the control logic. arithmetic upon the BER performance of turbo codes,” in Proc IEEE
Unrolling and pipelining of iterative decoders is necessary VTC 1999, Houston, TX, USA, May 1999, pp. 1683–1687.
[8] S. Dolinar and D. Divsalar, “Weight distributions for turbo codes using
to sustain high throughputs, but leads to a linear increase in random and nonrandom permutations,” JPL, TDA Progress Rep., Aug.
implementation complexity; however, it provides an excellent 1995.
YEO et al.: VLSI ARCHITECTURES FOR ITERATIVE DECODERS IN MAGNETIC RECORDING CHANNELS 755

[9] K. Andrews, C. Heegard, and D. Kozen, “Interleaver design methods [13] D. J. C. Mackay and M. C. Davey, “Evaluation of gallager codes for
for turbo codes,” in Proc. IEEE ISIT 1998, Cambridge, MA, USA, Aug. short block length and high rate applications,” in Proc. IMA Workshop on
1998, p. 420. Codes, Systems and Graphical Models 1999, Minneapolis, MN, USA,
[10] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison of optimal and Aug. 1999.
sub-optimal MAP decoding algorithms operating in the log domain,” in [14] Y. Kou, S. Lin, and M. P. C. Fossorier, “Low density parity check codes
Proc. IEEE ICC 1995, Seattle, WA, USA, Jun. 1995, pp. 1009–1013. based on finite geometries: A rediscovery,” in Proc. IEEE ISIT 2000,
[11] A. Viterbi, “An intuitive justification and a simplified implementation Sorrento, Italy, Jun. 2000.
of the MAP decoder for convolutional codes,” IEEE J. Select. Areas [15] B. Frey and F. Kschischang, “Early detection and trellis splicing: Re-
Commun., vol. 16, no. 2, pp. 260–264, Feb. 1998. duced-complexity iterative decoding,” IEEE J. Select. Areas Commun.,
[12] C. Berrou, P. Adde, E. Angui, and S. Faudeil, “A low complexity soft- vol. 16, no. 2, pp. 153–159, Feb. 1999.
output viterbi decoder architecture,” in Proc. IEEE ICC 1993, Geneva,
Switzerland, May 1993, pp. 737–740.

You might also like