A Bit-Serial Approximate Min-Sum LDPC Decoder - Chan Carusone A University of Toronto
A Bit-Serial Approximate Min-Sum LDPC Decoder - Chan Carusone A University of Toronto
FPGA Implementation
Ahmad Darabiha, Anthony Chan Carusone and Frank R. Kschischang
Department of Electrical and Computer Engineering, University of Toronto
Email: {ahmadd,tcc}@eecg.utoronto.ca, [email protected]
Abstract— We propose a bit-serial LDPC decoding scheme to reduce A. LDPC codes and min-sum decoding
interconnect complexity in fully-parallel low-density parity-check de-
coders. Bit-serial decoding also facilitates efficient implementation of
A binary (N, N −M ) LDPC code, C, is the null space of a sparse
wordlength-programmable LDPC decoding which is essential for gear M ×N parity-check matrix, H. It can also be described by a bipartite
shift decoding. To simplify the implementation of bit-serial decoding graph, or Tanner graph. Tanner graph of an LDPC code consists of
we propose a new approximation to the check update function in the two set of nodes. Check nodes {c1 , c2 , . . . , cM } represent the rows
min-sum decoding algorithm. The new check update rule computes only of H and variable nodes {v1 , v2 , . . . , vN } represent the columns. An
the absolute minimum and applies a correction to outgoing messages if
required. We present a 650-Mbps bit-serial (480, 355) RS-based LDPC edge connects the check node cm to the variable node vn if and only
decoder implemented on a single Altera Stratix EP1S80 FPGA device. To if Hmn is nonzero. For a code with a full-rank M × N parity-check
our knowledge, this is the fastest FPGA-based LDPC decoder reported matrix, H, the code rate is R = 1 − M/N . We denote the set of
in the literature. variables that participate in check cm as N (m) = {n : Hmn =
1} and the set of checks in which the variable vn participates as
I. I NTRODUCTION M (n) = {m : Hmn = 1}.
The following paragraphs describe min-sum (MS) decoding [2]
which can be considered as an approximation to the commonly-used
Low-density parity-check (LDPC) codes [1] have recently been
iterative sum-product (SP) algorithm [3]. Although the performance
adopted for several data communication applications due to their
of MS is generally a few tenths of a dB lower than that of SP
superior coding performance and parallelizable decoder architecture.
decoding, it is more robust to quantization errors when implemented
LDPC codes allow a fine-level parallel message-passing decoding in
with fixed-point operations [4]. Moreover, it requires much simpler
which all the check and variable nodes are updated concurrently. This
hardware for the check node functions compared to SP decoding.
parallelism can potentially be used to build a decoder with Multi-
In the MS decoding, similar to SP algorithm, the extrinsic messages
Gbit/sec throughput. The major obstacle for efficient implementation
are passed between check and variable nodes in the form of log-
of fully-parallel LDPC decoders is interconnect complexity which is (i)
likelihood ratios (LLRs). Let zmn represent the LLR value for bit n,
the result of random location of 1’s in the code’s parity-check matrix.
sent from variable node vn to check node cm in the ith iteration
In this paper, we propose a bit-serial scheme for fully-parallel (i)
and similarly mn represent the LLR value for bit n, sent from
LDPC decoders. Bit-serial computation allows variable and check
check node cm to variable node vn in the ith iteration. Suppose
nodes to communicate multi-bit messages over single wires, hence
W = (w1 , w2 , . . . , wN ) ∈ C and Y = (y1 , y2 , . . . , yN ) are the
reducing the interconnect complexity. In addition, we introduce a new
transmitted codeword and the received sequence respectively. The
approximation to the check update function in min-sum decoding. In
MS decoding algorithm consists of the following steps:
this approximation, in each check node only one minimum magnitude
1) Initialize the iteration counter, i, to 1 and let IM be the
is calculated over all the check node inputs. Depending on the number
maximum number of iterations allowed. `
of inputs that share the same minimum magnitude, a corrective (0)
2) Initialize zmn to the´ a posteriori LLR, λn = log P (vn =
constant is then added in order to generate the proper check outputs.
0|yn )/P (vn = 1|yn ) for 1 ≤ n ≤ N , m ∈ M (n).
We show that with 4-bit quantization this approximation reduces the
3) Update the check nodes, i.e., for 1 ≤ m ≤ M , n ∈ N (m),
check node area by 48% while introducing less than 0.1 dB loss in
calculate
BER performance. Y
(i) (i)
We illustrate feasibility of bit-serial LDPC decoding by imple- (i)
mn = min |zmn | sgn(zmn ). (1)
n ∈N (m)\n
menting a (480, 355) RS-based LDPC decoder on a single Altera n ∈N (m)\n
Stratix EP1S80 FPGA device based on the new proposed check node 4) Update the variable nodes, i.e., for 1 ≤ n ≤ N , m ∈ M (n),
architecture. The decoder operates at maximum clock frequency of calculate:
(i)
X (i)
61 MHz, performs 15 decoding iterations per frame and achieves 650 zmn = m n . (2)
Mbps throughput. m ∈M (n)\m
This paper is organized as follows. The rest of this section
5) Apply a hard decision, i.e., compute Ŵ = (ŵ1 , ŵ2 , . . . , ŵN )
briefly reviews LDPC codes and hardware implementation of iterative
where element ŵn is calculated as
message-passing decoders. Section II illustrates new approximations ( P (i)
for the check update functions in a min-sum decoder. Section III 0 if λn + m∈M (n) mn ≥ 0,
ŵn =
describes the internal architecture of bit-serial variable and check 1 otherwise.
nodes for a fully-parallel LDPC decoder based on the approximate
min-sum algorithm of Section II. Finally, in Section IV, an FPGA If Ŵ H T = 0 or i ≥ IM stop decoding and go to step 6.
implementation of a bit-serial (480, 355) fully-parallel LDPC decoder Otherwise set i = i + 1 and go to step 3.
is presented. 6) Output Ŵ (i) as the decoder output.
B. Decoder implementation the node-to-node message transfers without the need for extra routing
In message-passing LDPC decoding, a large number of messages channels. Programmability of the decoder wordlength allows one
need to be updated and transfered between check and variable nodes to efficiently trade-off complexity for error correction performance.
in each iteration. Previous works have proposed several approaches This in turn allows efficient implementation of gear-shift decoding
for representing and updating these messages. In [5], analog signals [10]. Gear-shift decoding is based on the idea of changing the
are used to represent the extrinsic messages. In analog decoders decoding update rule used in different iterations to simultaneously
the exponential voltage-current relationship of a transistor is used optimize hardware complexity and error correction performance. For
to realize the message-passing update functions. Although analog instance, gear shift decoding often suggests applying a complex
decoders have the advantage of low power consumption, they become powerfull update rule in the first few iterations followed by simpler
impractical for decoding long LDPC codes due to the noise and update functions in later iterations. Bit-serial computation allows
process mismatch. efficient shifting between update rules by changing the computations
More conventional LDPC decoders often use multi-bit digital wordlength.
signals to represent the messages. In partially-parallel decoders [6], Bit-serial decoding, however, imposes some challenges. The im-
[7], the messages are transferred between the nodes through memory. mediate effect is that it reduces the decoder throughput compared
This architecture reduces the decoder area by sharing the processing with fully-parallel implementations, as multiple clock cycles are
units, but this comes at the cost of reduced throughput. To achieve required for transmitting a single message. Also some common check
higher throughput, in the fully-parallel decoder presented in [8], all and variable update functions can not be efficiently implemented
check and variable nodes are directly instantiated in hardware. Using bit-serially. Although bit-serial fully-parallel LDPC decoders have
this architecture, a throughput of 1 Gbps with 64 iterations per frame a lower throughput compared with bit-parallel fully-parallel LDPC
is reported. The major challenge in the implementation of fully- decoders we will show in this paper that their throughput can still be
parallel LDPC decoders is the complex and random interconnection higher than hardware-sharing decoder schemes.
between the variable and check nodes. This problem is worsened
II. S IMPLIFIED CHECK UPDATE FUNCTION
when multi-bit buses are used to realize the edges in the code Tanner
graph. The MS decoding algorithm, as described in Section I, is cumber-
some if implemented in a bit-serial hardware decoder. In this section,
C. Bit-serial computation we introduce an approximation to the MS algorithm that reduces
To reduce the complexity of the interconnect in fully-parallel the hardware complexity of check nodes while causing minimal
LDPC decoders, in this paper we investigate a bit-serial approach degradation in code performance. In fact, this approximation is also
for both communicating and computing extrinsic messages. Fig. 1 applicable to bit-parallel hardware decoders.
shows the difference between the conventional bit-parallel scheme The first step is to replace the check update rule of (1) with
and a bit-serial scheme for a simple case of transferring an n-bit (i)
Y (i)
number, bn · · · b2 b1 . In Fig. 1(a) all the n bits are sent over n parallel (i)
mn = min |zmn | sgn(zmn ). (3)
n ∈N (m)
n ∈N (m)\n
lines in one clock cycle. In contrast, in a bit-serial scheme as in Fig.
1(b), the message is sent over a single line in n clock cycles. In other words, the sign of the check node outputs are calculated
exactly the same as before but now the output magnitude is the
minimum of magnitudes of all input messages. Fig. 2 compares the
Clk Clk
BER performance of original MS decoding algorithm with that of the
Line #1 b1 modified MS based on (3) for two RS-based LDPC codes [11] using
Serial line b1 b2 bn
Line #2 b2 full-precision computations. This graph shows that with full-precision
computations, the two algorithms perform almost identically. It is
Line #n bn n cycles clear that a check update rule as in (3) significantly reduces the
hardware complexity. The reason is that once the minimum among
(a) (b) all input magnitudes is found it is sent out as the magnitude of all
(i)
the outgoing messages, mn , for all n ∈ N (m).
Fig. 1. Two alternatives for synchronous transmission of an n-bit number We have observed that although the above modification to MS
(a) bit-parallel: n bits sent in one clock cycle over n wires. (b) bit-serial: n results in almost no performance loss under full-precision operations,
bits sent in n clock cycles over one wire. it introduces a considerable loss when performed in finite-precision.
Fig. 3 shows the effect of the MS approximation when applied to
Stochastic computation [9] is similar to bit-serial computation in quantized messages. In the following paragraphs we introduce a
that it communicates extrinsic messages over single wires. It has further change to the modified MS decoding algorithm that reduces
a very simple check and variable node architecture but needs a the performance gap shown above.
significant amount of hardware overhead in oreder to translate the The sign of the output messages in the new check update rule is
stochastic messages at the decoder inputs and outputs. In addition, the same as in (3). The magnitude of the output message is calculated
the stochastic computation uses a redundant number representation as follows. First, for check node cm , in the ith iteration, we define
(i) (i) (i)
which limits the decoder throughput. Mm = minn ∈N (m) |zmn |. We also define 1 ≤ Tm ≤ dc as the
In addition to simplifying the node-to-node interconnection, the bit- (i) (i) (i)
number of inputs zmj to check node cm that satisfy |zmj | = Mm .
serial approach has several other advantages for fully-parallel LDPC The magnitude of check node outputs are calculated as
decoders. In a bit serial scheme, the wordlength of computations
can be increased simply by increasing the number of clock cycles (
(i) (i) (i) (i)
allocated for transmitting the messages. Using this property, the Mm + 1 if Tm = 1 and zmn = Mm
|(i)
mn | = (i) (4)
precision of the decoder can be made programmable just by re-timing Mm otherwise.
−1
10
original min−sum, LDPC (2048, 1723) input is out of competition. Notice that the circuit in Fig. 4 only
modified min−sum, LDPC (2048, 1723)
original min−sum, LDPC (992, 833) processes the magnitude of the check node inputs whereas the sign
modified min−sum, LDPC (992, 833)
−2
10 bit is generated separately using an XOR tree.
−3
Status
10 Input #1 flags
BER
−4
10 Input #2
−6
10 Input #dc
3.6 3.8 4 4.2 4.4 4.6 4.8 5
Eb/N0
Fig. 2. Comparison between original min-sum and modified min-sum under Fig. 4. A bit-serial module for detecting the minimum magnitude of the
full-precision operations for (2048, 1723) and (992, 833) LDPC codes. check node inputs.
Simulation results plotted in Fig. 3 show that with the new check bit-serial full adder bit
update rule using 4-bit quantization the BER performance gap to the inputs + flip-flop serial
original 4-bit MS algorithm is reduced from 0.7 dB to less that 0.1 Input #1
outputs
dB at BER of 10−6 . More importantly, the error floor effect is also Output #1
Input #2
avoided. In spite of the extra hardware needed for the correction term,
Output #2
VLSI implementation of a degree-15 check node using a CMOS- Input #3
Output #3
90nm cell library shows that a check node based on (4) is 48%
Input #4 Output #4
smaller than a check node based on (1). This is because there is no
Output #5
need to calculate the second minimum among the check node inputs. Input #5
Input #6 Output #6
−1
10
original min−sum, full precision
original min−sum, 4bit
−2 modified min−sum as in (3), 4bit
10 modified min−sum with correction as in (4), 4bit Fig. 5. A degree-6 variable node architecture for computing (2) with a
−3
forward-backward architecture [12]. Each adder box consists of a full-adder
10 and a flip-flop to store the carry from the previous cycle.
−4
10
To find an efficient bit-serial variable node architecture, we have
BER
−5
investigated two alternatives. The first architecture, shown in Fig.
10
5, is based on a forward-backward computation [12]. The main
−6
10
difference between our approach and [12] is that here all the inputs
and outputs are bit-serial. The main problem with the forward-
−7
10 backward architecture of Fig. 5 is that for a variable node of degree
dv the critical path consists of a chain of (dv − 2) two-input adders.
−8
10
3.5 4 4.5 5 5.5 6 For LDPC codes with relatively high dv , this can limit the timing
Eb/N0
performance of the decoder. The second variable node architecture
investigated in this paper is shown in Fig. 6. In this architecture,
Fig. 3. Comparison between original min-sum and modified min-sums as in
(3) and (4) under fixed-point operations for (2048, 1723) LDPC code. the bit-serial inputs are first converted to parallel inputs and then the
additions are performed in one cycle using parallel adders/subtracters.
The parallel outputs are finally converted back to bit-serial format
III. N ODE A RCHITECTURE before being sent to check nodes.
This section describes an internal architecture for bit-serial hard- Table I summarizes the VLSI hardware cost and timing perfor-
ware implementation of variable and check nodes based on (2) and mance of two degree-6 variable nodes corresponding to the two above
(4) respectively. As discussed in Section II, in modified MS algorithm alternatives. The parameters in this table are based on the synthesis
only the smallest magnitude among all check inputs needs to be results using a CMOS 90nm cell library and with 3-bit quantization.
found. Fig. 4 shows the pipelined bit-serial module that finds the Based on Table I, we have used the variable node architecture of Fig.
minimum of the check inputs. This module receives dc inputs. Each 6 in the design presented in this paper since it is superior both in
input is an n-bit sign-magnitude binary number which is received bit- terms of timing and area.
serially (MSB-bit first). The output is a bit-serial n-bit number which Both check and variable nodes in this design are pipelined. For
corresponds to the smallest magnitude in the inputs. Associated with n-bit quantized input messages they generate n-bit output messages
each input there is a flip flop acting as status flag which indicates in n clock cycles. Each iteration of LDPC message-passing decoding
whether that input is still a candidate for being the minimum. At consists of one check and one variable node update. As a result, using
the beginning, the status flags are all reset to zero. As the MSB bits a conventional scheme, 2n clock cycles are needed to complete one
are received some flags become ’1’ indicating that the corresponding iteration. However, in this design we adopt a block-interlaced scheme
Input #1
1
2 n n interlacing technique and a wordlength of 3 bits, each iteration takes 3
Input #2
n parallel Output #1 clock cycles to complete which results in 650 Mbps total throughput.
adder
−3
bit n n 10
(480,355) LDPC, 3bit, Hardware
serial Output #2 (480,355) LDPC, 3bit, Bit−true simulation
inputs
n n outputs
BER
Output #dv
−5
10
Fig. 6. A variable node architecture for computing (2) with parallel adders
and parallel-serial converters at the inputs and outputs.
TABLE I −6
10
C OMPARISON BETWEEN VARIABLE NODE ARCHITECTURES OF F IG . 5 4 4.5 5 5.5 6 6.5
Eb/N0
( FORWARD - BACKWARD ) AND F IG . 6 ( PARALLEL ADDER / SUBTRACTERS )
WITH dv = 6 AND 3- BIT QUANTIZATION SYNTHESIZED WITH CMOS Fig. 7. FPGA hardware BER results and bit-ture software simulation.
90nm LIBRARY CELLS .
V. C ONCLUSION
Architecture Forward-backward Parallel adder
In this paper we presented a bit-serial architecture for fully-parallel
Combinational area (µm2 ) 2484 2099 LDPC decoding. We also proposed a new approximation to check
Non-Combinational area (µm2 ) 623 405
update function in MS decoding. A 650-Mbps FPGA-based fully-
Total area (µm2 ) 3107 2504
parallel LDPC decoder based on the above ideas is presented in this
Minimum clock period (nsec) 3 2.20 paper which to our knowledge is the fastest FPGA LDPC decoder
reported in literature.
[13] where two frames are processed in the decoder simultaneously R EFERENCES
in an interlaced fashion; while the check nodes process one frame, [1] R. G. Gallager, Low-Density Parity-Check Codes. Cambridge, MA:
the variable nodes are processing the neighboring frame. So, in effect MIT press, 1963.
it takes only n cycles to complete one iteration, hence doubling the [2] N. Wiberg, Codes and decoding on general graphs, PhD thesis. Linkop-
ing: Linkoping University, 1996.
throughput. [3] F. R. Kschischang, B. J. Frey, and H. A. Loeliger, “Factor graphs and
the sum-product algorithm,” IEEE Trans. on Information Theory, vol. 47,
IV. FPGA I MPLEMENTATION pp. 498–519, Feb. 2001.
To demonstrate the feasibility of bit-serial message-passing decod- [4] A. B. F. Zarkeshvari, “On implementation of min-sum algorithm for
decoding low-density parity-check (LDPC) cpdes,” in IEEE Globecom
ing, we have developed a fully-parallel (480, 355) RS-based LDPC
conference, 2002.
decoder on a single Altera Stratix EP1S80 FPGA device using a [5] F. Lustenberger, On the design of analog VLSI iterative codes, PhD
configurable prototyping board called Transmogrifier-4 [14]. This thesis. Zurich: Swiss Federal Institute of Technology, 2000.
decoder updates the extrinsic messages using the node architectures of [6] E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam, “VLSI architectures
Fig. 4 and Fig. 6. Since the updated messages are carried bit-serially for iterative decoders in magnetic recording channels,” IEEE Transac-
tions on Magnetics, vol. 37, pp. 748–755, March 2001.
over single wires, the complexity of node-to-node interconnections is [7] T. Zhang and K. K. Parhi, “A 54 MBPS (3, 6)-regular FPGA LDPC
less than that of conventional bit-parallel fully-parallel decoders [8]. decoder,” in IEEE Workshop on Signal Processing Systems, San Diego,
Fig. 7 shows the measured BER performance from decoder hardware CA, 2002.
as well as the bit-true simulation. Table II summarizes the FPGA [8] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2
low-density parity-check decoder,” IEEE Journal of Solid-State Circuits,
implementation results. The decoder operates at clock frequency of vol. 37, no. 3, Mar. 2002.
61 MHz and performs 15 iterations per frame. Using the block- [9] V. Gaudet and A. Rapley, “Iterative decoding using stochastic computa-
tion,” Electronics Letters, vol. 39, no. 3, pp. 299–301, February 2003.
[10] M. Ardakani and F. R. Kschischang, “Gear-shift decoding,” in Proc. 21st
TABLE II Biennial Symp. on Comm., Queen’s University, Canada, 2002.
(480, 355) RS- BASED LDPC DECODER IMPLEMENTATION RESULTS ON [11] I. Djurdjevic, J. Xu, K. Abdel-Ghaffar, and S. Lin, “A class of low-
A LTERA S TRATIX EP1S80 D EVICE . density parity-check codes constructed based on Reed-Solomon codes
with two information symbols,” IEEE Comm. Letters, vol. 7, no. 7, July
2003.
Logic elements (LEs) 66,588 (84%) [12] X.-Y. Hu, E. Eleftheriou, D.-M. Arnold, and A. Dholakia, “Efficient im-
plementation of the sum-product algorithm for decoding LDPC codes,”
Max clock frequency (MHz) 61 in IEEE Global Telecommunications Conference, vol. 2, San Antonio,
Code length 480 TX, 2001, pp. 1036–1036E.
[13] A. Darabiha, A. Chan Carusone, and F. R. Kschischang, “Block-
Iterations per frame 15 interlaced fully-parallel LDPC decoders with reduced interconnect com-
Wordlength (bits) 3 plexity,” submitted to IEEE Transactions on VLSI Systems.
[14] Transmogrifier-4, World Wide Web, http://www.eecg.utoronto.ca/∼tm4.
Decoder throughput (Mbps) 650