0% found this document useful (0 votes)
94 views4 pages

A Low-Power 1-Gbps Reconfigurable LDPC Decoder Design For Multiple 4G Wireless Standards

This document summarizes a research paper presented at the IEEE International System-on-Chip Conference in September 2008. The paper describes a low-power, reconfigurable LDPC decoder design capable of 1 Gbps speeds for multiple 4G wireless standards. The decoder uses a pipelined layered belief propagation algorithm to achieve partial-parallel decoding of structured LDPC codes. Two power saving schemes are employed to reduce power consumption by up to 65%. The decoder was synthesized and implemented in a 90nm CMOS technology with an area of 3.5 mm2 and maximum clock frequency of 450 MHz at 410 mW power consumption.

Uploaded by

ig77
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views4 pages

A Low-Power 1-Gbps Reconfigurable LDPC Decoder Design For Multiple 4G Wireless Standards

This document summarizes a research paper presented at the IEEE International System-on-Chip Conference in September 2008. The paper describes a low-power, reconfigurable LDPC decoder design capable of 1 Gbps speeds for multiple 4G wireless standards. The decoder uses a pipelined layered belief propagation algorithm to achieve partial-parallel decoding of structured LDPC codes. Two power saving schemes are employed to reduce power consumption by up to 65%. The decoder was synthesized and implemented in a 90nm CMOS technology with an area of 3.5 mm2 and maximum clock frequency of 450 MHz at 410 mW power consumption.

Uploaded by

ig77
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

IEEE International System-on-Chip (SoC) Conference (SOCC'08). Sept.

2008

A LOW-POWER 1-Gbps RECONFIGURABLE LDPC DECODER DESIGN FOR

MULTIPLE 4G WIRELESS STANDARDS

Yang Sun and Joseph R. Cavallaro


Department of Electrical and Computer Engineering
Rice University, Houston, TX 77005
Email: {ysun, cavallar}@rice.edu

Abstract— In this paper we present an efficient system-on- 0, where x is a codeword (x ∈ C) and H can be viewed
chip implementation of a 1-Gbps LDPC decoder for 4G (or as a bipartite graph where each column and row in H
beyond 3G) wireless standards. The decoder has a scalable represent a variable node and check node, respectively.
datapath and can be dynamically reconfigured to support mul-
tiple 4G standards. We utilize a pipelined version of the layered
belief propagation algorithm to achieve partial-parallel decoding
A. Block structured LDPC codes
of structured LDPC codes. Instead of using the sub-optimal Min- Non-zero elements in H are typically placed at random
sum algorithm, we propose to use the powerful belief propaga- positions to achieve good performance. However, this
tion (BP) decoding algorithm by designing an area-efficient soft-
input soft-output (SISO) decoder. Two power saving schemes randomness is unfavorable for efficient VLSI implemen-
are employed to reduce the power consumption up to 65%. The tation that calls for structured design. To address this
decoder has been synthesized, placed, and routed on a TSMC issue, block-structured LDPC codes are recently pro-
90nm 1.0V 8-metal layer CMOS technology with a total area of posed for several new communication standards such
3.5 mm2 . The maximum clock frequency is 450 MHz and the as IEEE 802.11n, IEEE 802.16e, DVB-S2 and DMB-
estimated peak power consumption is 410 mW.
T. As shown in Fig. 1, a block structured parity check
I. INTRODUCTION matrix can be viewed as a 2-D array of square sub
matrices. Each sub matrix is either a zero matrix or a
The approaching fourth-generation (4G) wireless sys- cyclically shifted identity matrix Ix . Generally, a block
tems are projected to provide 100 Mbps to 1 Gbps structured parity check matrix H consists of a j × k
speeds by 2010, which consequently leads to orders of array of z × z cyclically shifted identity matrices with
magnitude complexity increases in the wireless receiver random shift values x (0 ≤ x < z). Table 1 summarizes
SoC (System-on-Chip). As a core technology in wireless the design parameters for H in several standards. In
communications, FEC (forward error correction) coding
has migrated from 2G convolutional/block codes to more x

powerful 3G Turbo codes, and LDPC (Low-density parity- z


Ix 1-th Layer
check) codes [1] forecast for 4G systems because of their
excellent error correction performance and highly paral- 2z
0 2-th Layer
lel decoding scheme. To meet the power consumption 3-th Layer
constrains in wireless handsets, it is very challenging to 3z
design a flexible and high throughput LDPC decoder. 4-th Layer
4z
Most of the research on LDPC decoder design so far z 2z 3z 4z 5z 6z 7z 8z
has focused on one particular system in which specific Fig. 1. A block structured parity check matrix with block rows (or layers)
optimizations are made to improve the decoder perfor- j = 4 and block columns k = 8, where the sub-matrix size is z × z.
mance. For example, authors in [2][3] discussed LDPC
decoders for WiMax system, and authors in [4][5] pre- Table 1: Design parameters for H in several standards
sented custom designed, non-standard LDPC decoders. LDPC Code WLAN-802.11n WiMax-802.16e DMB-T
In this paper, we discuss a scalable and dynamically j 4-12 4-12 24-48
reconfigurable LDPC decoder targeting multiple 4G stan- k 24 24 60
dards. For low power implementations, we introduce an z 27-81 24-96 127
early termination scheme and a distributed SISO de-
coding and memory banking scheme to reduce power order to efficiently decode structured LDPC codes, we
consumption. adopt the layered belief propagation (LBP) algorithm [6],
which is described in Algorithm 1. In the description
II. DECODING ALGORITHM of the algorithm, Nm is the set of variable nodes that
A binary LDPC code is a linear block code specified by connected to check node m, and Nm \n is the set Nm with
a very sparse binary M × N parity check matrix: H · xT = variable node n excluded. Λmn and λmn denote the check
and variable message, respectively. Ln denotes the a where the  and operations are defined as a  b ,
a b a b
posteriori probability (APP) log-likelihood ratio (LLR) of f (a, b) = log 1+e e
ea +eb
and a b , g(a, b) = log 1−e e
ea −eb
[8][9].
variable node n: Ln = log(P (xn = 0)/P (xn = 1)). H l This computation method is especially suitable for the
denotes l-th layer in H. proposed BS scheduling algorithm in which the macro
blocks are processed in sequential order. For hardware
Algorithm 1 Layered belief propagation algorithm implementation, f (·) and g(·) functions can be simplified
Initialization: to
∀(m, n) with H(m, n) = 1, set Λmn = 0, Ln = 2y n
σ2

for iteration i = 1 to I do f (a, b) = sign(a) sign(b) min(|a|, |b|) +

for layer l = 1 to L do
log(1 + e−(|a|+|b|) ) − log(1 + e− |a|−|b| ) ,

1) Read: (2)
∀(m, n) with H l (m, n) = 1:

g(a, b) = sign(a)sign(b) min(|a|, |b|) +
Read Ln and Λmn from memory 
2) Decode: log(1 − e−(|a|+|b|) ) − log(1 − e− |a|−|b| ) .

λmn = LQ n − Λmn
In hardware, the non-linear correction terms log(1 + e−x )

Λnew
P
mn = j∈Nm \n sign(λmj )Ψ j∈Nm \n Ψ(λmj )
Lnew
n = λmn + Λnewmn and log(1 − e−x ) in (2) are approximated using low-
3) Write back: complexity 3-bit lookup tables (LUTs) [9].
Write Lnew
n and Λnew
mn back to memory
end for C. Radix-2 SISO decoder
end for Fig. 3 shows the proposed SISO decoder architec-
Decision making: xˆn = sign(Ln ) ture for generating Λmn . We refer to it as Radix-2 (R2)
recursion architecture since only one element can be
III. VLSI ARCHITECTURE processed in one clock cycle. The R2-SISO core consists
A. Block-serial scheduling algorithm of one f (·) recursion unit followed by one g(·) unit. Note
that the g(·) unit would have the same structure as the
To implement Algorithm 1 in hardware, we propose a
f (·) unit but with a different LUT.
block-serial (BS) scheduling algorithm as shown in Fig. 2.
Fig. 4 shows the decoding schedule for check row
In this algorithm, one full iteration is divided into j sub
m. During the first dm 1 cycles, the incoming variable
iterations. SISO decoding is applied to each layer in
messages λmn (∀n ∈ Nm ) are fed to the decoder
sequence. Each z × z sub-matrix is treated as a macro
sequentially and the f (·) unit is reused dm times to
within which all the involved parity checks are processed
obtain the intermediate  sum Sm . Then, the outgoing
in parallel using z number of SISO decoders. Each SISO
messages Λmn (∀n ∈ Nm ) are generated in a sequential
decoder is independent from all others since there is no
order by the g(·) unit. Though the decoding is sequential
data dependence between adjacent check rows.
for each check row, multiple (z) check rows within one
A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3 layer can be processed in parallel by employing multiple
0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0
D0 B2 A1 SISO 1 D0' B2' A1'
(z) SISO decoders, which increases the throughput by
Datapath

D1 B3 A2 SISO 2 D1' B3' A2'


0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0
0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
D2 B0 A3 SISO 3
D2' B0' A3' a factor of z (see Fig. 2). Furthermore, the decoding
D3 B1 A0
1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 SISO 4
D3' B1' A0'
throughput can be improved by overlapping the decoding
Block-serial scheduling 1 Read 2 Decode 3 Write back of two layers as shown in Fig. 4. This scheduling would
require dual-port memory for simultaneous read and write
operations. Typically data dependencies between layers
Layer 1 Layer 2 … Layer j
will occasionally stall the pipeline for one or more cycles.
Sub-iteration 1 Sub-iteration 2 Sub-iteration j
However the pipeline stalls can be avoid by shuffling the
order of the layers [10].
Fig. 2. Block-serial (BS) scheduling algorithm
D. Radix-4 SISO decoder via look-ahead transform
B. Low-complexity implementation of BP algorithm
Conventionally, function Ψ(x) = − log(tanh(|x/2|)) is To increase the throughput of the R2-SISO decoder, a
used for the decoding operations in Algorithm 1. However, look-ahead transform can be used for the f (·) recursion.
the Ψ(x) function is prone to quantization noise and This transform leads to an increase in the number of data
can be numerically unstable [7]. Alternately, a different processed in each cycle as shown in Fig. 5, where two
and numerically more robust way to compute the Λmn is elements are processed in one clock cycle. We refer to
shown as this transform as Radix-4 (R4) recursion. Fig. 6 shows the
X  X  corresponding Radix-4 SISO decoder architecture. Since
Λmn = λmj = λmj λmn , (1)
1d is the number of non-zero elements in check row m.
j∈Nm \n j∈Nm m
R2-SISO Core Table 2: Comparison of two SISO decoder architectures
Sm Λmn
λmn 8 f (·) 8
g (·) 8 450 MHz 325 MHz 200 MHz
D
λmn Ln R2 SISO area 6978 µm2 6367 µm2 6197 µm2
FIFO + R4 SISO area 12774 µm2 10077 µm2 8944 µm2
dm Speedup
η = Area overhead
1.09 1.26 1.39
log(1 + e − (|a|+|b|) )
8 |a| f (·) Unit
a ABS + LUT

log(1 + e − (||a|−|b||) )
8 |b| - proposed LDPC decoder architecture. In the proposed
b ABS - ABS LUT + M
8
U

Min(|a|, |b|) -X X BS scheduling algorithm, the parallelism factor is equal


1 1 Sign bit
MIN to the sub-matrix size z. Since parameter z varies from
Sign(a) ^ Sign(b)
XOR
code to code, i.e. 19 different sizes of z are defined in
WiMax, we must design a datapath that is modular and
Fig. 3. Radix-2 (R2) SISO decoder architecture scalable to support different code types. This is achieved
Layer l by employing distributed SISO decoders and memory
λm1, λm2, λm3 ... (1) Read banks as shown in Fig. 7. This architecture can also
reduce the overall power consumption by deactivating the
(2) Decoding memory banks and SISO decoders that are not being
(n =1, 2, 3, …)
used. The L messages, on the other hand, are stored
Λm1, Λm2, Λm3 ... (3) Write back in a central memory bank for parallel accessing by z
dm cycle dm cycle SISO decoders. This is achieved by grouping [1 × z]
Layer l+1 L messages (associated with each sub-matrix) into one
Read memory word.
Decoding Stage 1 Decoding Stage 2 The decoding flow for one sub-iteration is as follows:
Write back at each cycle, [1 × z] L messages are first fetched from
the L-memory and passed through a circular shifter to
Fig. 4. Pipelined decoding schedule be routed to z SISO decoders. The soft input information
λmn is formed by subtracting the old extrinsic message
Λmn from the APP message Ln . Then the SISO decoder
two elements can be processed in each cycle, it has a
generates a new extrinsic message Λmn and APP mes-
throughput speed up of 2. Table 2 summarizes the syn-
sage Ln , and stores them back to the Λ-memory and the
thesis results (90nm CMOS technology) for the R4 and
L-memory, respectively.
R2 SISO decoders. To compare these two architectures,
we define an efficiency factor η as the throughput speed-
up with R4-SISO divided by the area overhead. As can
L-Mem
be seen, R4-SISO achieves throughput-area efficiency
gains especially at lower clock frequency. Circular Shifter
zxz
x(2n+1) y(2n)
x(2n) -+ - + Ln -+
f (·) f (·) y(2n+1) Λmn λmn
Λ-Mem

SISO SISO SISO


Λ-Mem
Λ-Mem

D ...
Core Core Core
1 2 z
Fig. 5. One level look-ahead transform of f (·) recursion
Λ’ mn L’ n

R4-SISO Core + L2n+1


FIFO Fig. 7. LDPC decoder architecture with scalable datapath
λm,2n+1 g (·) Λm,2n+1
dm/2
λm,2n Sm
f (·) f (·) By designing proper control logic, the decoder can
D
be dynamically reconfigured to support multiple block-
FIFO g (·) Λm,2n
structured LDPC codes. With this partial-parallel archi-
dm/2 + L2n
tecture, the pipelined (Radix-4) decoding throughput is
Fig. 6. Radix-4 (R4) SISO architecture
approximately equal to 2×k×z×R×f E×I
clk
, where k is the
number of block-columns in H, z is the sub-matrix size,
R is the code rate, E is the total number of non-zero sub-
E. Scalable and reconfigurable LDPC decoder matrices in H, and I is the number of full iterations. Note
To support multiple LDPC codes, the datapath has that the latency of the circular shifter is not included in the
to be scalable and reconfigurable. Fig. 7 shows the throughput analysis, which may degrade the throughput
by about 5-15%. 450 450

400 425
IV. RESULTS
400
350
A multi-mode LDPC decoder which supports both IEEE

Power consumption (mW)

Power consumption (mW)


With early termination
Without early termination 375
802.11n and IEEE 802.16e has been synthesized on 300

a TSMC 90nm 1.0V 8-metal layer CMOS technology. 350

250
Fig. 8 shows the VLSI layout view of the LDPC decoder. 325

Table 3 compares this decoder with the state-of-the-art 200


300
LDPC decoders of [3] and [4]. The decoder in [3] has
150
275
the flexibility to support 19 modes of LDPC codes in the
WiMax standard, however it will not support the higher 100
0 1 2 3 4 5
250
500 1000 1500 2000 2500
Eb/N0 (dB) Block size (bit)
data rates envisioned for 4G and IMT-Advanced. The
decoder in [4] has a throughput of 640 Mbps, but it
(a) Early termination (Block (b) Distributed SISO decod-
does not have the flexibility to support multiple codes. As size = 2304, Max iter. = 10) ing and memory banking
can be seen, our decoder shows significant performance
improvement in throughput, flexibility, area and power. Fig. 9. Two power reduction techniques

deactivating the unused SISO decoders and memory


ROM

CTRL Misc
Logic
banks when the LDPC code size is small.
L-Mem Circular
Shifter
In/Out V. CONCLUSION
Buffer
A high performance LDPC decoder has been de-
R4-SISO Decoder + scribed that achieves a throughput of 1 Gbps. The de-
Distributed Λ-Mem coder has a scalable datapath and can be dynamically
x96 reconfigured to support multiple 4G wireless standards.
VI. Acknowledgement
This work was supported in part by Nokia and by NSF
under grants CCF-0541363, CNS-0551692, and CNS-
Fig. 8. VLSI layout view of the LDPC decoder 0619767.
References
Table 3: LDPC decoder architecture comparison [1] R. Gallager, “Low-density parity-check codes,” IEEE Trans. Inf.
This Work [3] [4] Theory, vol. 8, pp. 21–28, Jan. 1962.
Flexibility 802.16e/.11n 802.16e 2048-bit fixed [2] T. Brack, M. Alles, F. Kienle, and N. Wehn, “A Synthesizable IP
Max Throughput 1 Gbps 111 Mbps 640 Mbps Core for WIMAX 802.16e LDPC Code Decoding,” in IEEE 17th
Total Area 3.5 mm2 8.29 mm2 14.3 mm2 Int. Symp. Personal, Indoor and Mobile Radio Communications
Max Frequency 450 MHz 83 MHz 125 MHz (PIMRC), 2006, pp. 1 – 5.
Peak Power 410 mW 52 mW 787 mW [3] X.-Y. Shih, C.-Z. Zhan, C.-H. Lin, and A.-Y. Wu, “A 19-mode
Technology 90 nm 0.13 µm 0.18 µm 8.29mm2 52-mW LDPC Decoder Chip for IEEE 802.16e System,”
Max Iteration 10 8 10 in 2007 Symposium on VLSI Circuits, June 2007.
Algorithm Full BP Min-Sum Linear Apprx. [4] M.M. Mansour and N.R. Shanbhag, “A 640-Mb/s 2048-Bit Pro-
grammable LDPC Decoder Chip,” IEEE Journal of Solid-State
Circuits, vol. 41, pp. 684–698, March 2006.
As low power design is critical for wireless receivers, [5] A.J. Blanksby and C.J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-
in order to save power, we have implemented a simple 1/2 low-density parity-check code decoder,” IEEE Journal of Solid-
State Circuits, vol. 37, no. 3, pp. 404–412, 2002.
and effective early termination criteria for stopping the [6] D. Hocevar, “A reduced complexity decoder architecture via lay-
iteration process. The decoding will stop if the following ered decoding of LDPC codes,” in IEEE Work. on Signal Process-
two conditions are satisfied: 1) the hard decisions for the ing Syst. (SIPS), Oct 2004, pp. 107–112.
[7] T. Zhang, Z. Wang, and K. Parhi, “On finite precision implementa-
information bits based on their LLR values do not change tion of low density parity check codes decoder,” in Int. Symposium
over two successive iterations, and 2) the minimum of the on Circuits and Systems (ISCAS), vol. 4, May 2001, pp. 202–205.
absolute values of the information bit LLRs is larger than [8] J. Hagenauer, E. Offer, and L. Papke, “Iterative decoding of binary
block and convolutional codes,” IEEE Trans. Inf. Theory, vol. 42,
a pre-defined threshold. As shown in Fig. 9 (a), when no. 2, pp. 429 – 445, 1996.
the wireless channel is good, the decoding needs fewer [9] X.-Y. Hu, E. Eleftheriou, D.-M. Arnold, and A. Dholakia, “Efficient
iterations to converge, which therefore saves substantial implementations of the sum-product algorithm for decoding LDPC
codes,” in IEEE GLOBECOM, Oct. 2001, pp. 1036–1036.
power (up to 65% power reduction). Another power sav- [10] K. Gunnam, G. S. Choi, M. B. Yeary, and M. Atiquzzaman, “VLSI
ing technique is to use distributed SISO decoders and Architectures for Layered Decoding for Irregular LDPC Codes of
memory banks. Fig. 9 (b) shows the power reduction from WiMax,” in Int. Conf. Commun. (ICC), June 2007.

You might also like