An Automated FPGA-based Framework For Rapid Prototyping of Nonbinary LDPC Codes
An Automated FPGA-based Framework For Rapid Prototyping of Nonbinary LDPC Codes
CN RAM
BW
FW c-v Perm/
design Q Number of quantization bits Mem
ECN VN
Mem/ VN
Perm-1 VN
Posterior
Memory
v-c
parame- LS-VN VN sorter length ECN Mem
LUT
FE SE BE
demonstrates great potential for practical, but they also degrade Fig. 2. Reconfigurable emulation system architecture
the error correction performance.
parameter combination. The challenges call for an automated
We briefly summarize the EMS and MM algorithms here for
design flow with new decoder and emulation architecture that
completeness. Both algorithms follow a five-step decoding pro-
enables full reconfigurability and delivers a high throughput.
cess as follows: (1) each variable node is initialized with sorted
prior log-likelihood ratio vectors (LLRV) L along with their as- III. RECONFIGURABILE EMULATION
sociated GF indices L. The length of LLRVs is determined by Reconfigurable emulation for NB-LDPC requires address-
message truncation number nm; (2) variable-to-check (v-c) mes- ing parameters of three categories: code parameters, decoder de-
sages are permuted based on the H matrix and sent to the check sign parameters and run-time parameters. We summarize the pa-
nodes. In the first iteration, the priors are used as the v-c mes- rameters with their descriptions as in Table I.
sages; (3) for each adjacent variable node vj, check node ci com-
putes the check-to-variable (c-v) message {Vij[k]}, k {0, …, A. Emulation System Design
nm – 1}, that the parity-check equation is satisfied if vj = Vij[k]. Suppose without loss of generality that all-zero codeword
The computation is implemented as a forward-backward recur- are transmitted, we introduce a fully reconfigurable emulation
sion: EMS computes the sums through this recursion while MM system with high-throughput decoder.
picks only the maxes without summation operations for even Fig. 2 shows the architecture of proposed emulation system.
lower complexity. Note that c-v messages are sorted and only
A top controller implementing a finite state machine orches-
the nm highest probabilities are stored in both algorithms. Bub-
trates the emulation. The emulation system stays at IDLE state
ble-check technique [16] improves the check node latency as
until input Start jumps from 0 to 1. System then enters the RUN
well as hardware utilization by reducing the sorter length from
state. There are two sub-states in the RUN state: (1) Prior Gen-
nm to LS-CN while still maintaining the equivalent functionality;
eration (PG), (2) Decode and Decision (DD). They iterate for
(4) c-v messages are inverse permuted before being sent to the
each frame and a counter COUNT keeps track of the RUN state
variable nodes; (5) variable node vj computes the v-c message
and increments by 1 every time the system reaches the end of
{Uji[k]}, k {0, …, nm – 1} for each adjacent check node ci DD state. The state transition diagram is shown in Fig. 3.
based on the prior LLRVs and the permuted c-v messages.
Skimming technique [6] skims less reliable probabilities and re- In PG state, LLRV calculation channels in the prior genera-
duces VN sorter length from nm to LS-VN. The procedure repeats tor compute sorted LLRVs of length nm along with their corre-
itself from step (2) until iteration limit L. sponding GF indices and store them to a dual-port prior memory.
In each channel, log2(q) parallel AWGN generators produce
Decoder architectures implementing EMS or MM have been log2(q) parallel LLRs of Q-bit and send them to a multiplexer
developed for FPGA emulation of various NB-LDPC codes [6], array. The multiplexer array also reads log2(q) bits that represent
[8], [9], [10], and most of them have limited flexibility for pa- a GF(q) symbol from the GF LUT: Note that each bit is associ-
rameters like iteration limit L; however, important parameters ated with a LLR. The multiplexer array selects LLR if the asso-
that are significant factors of error-correction performance and ciated bit is 1 and passes a 0 if the associated bit is 0. A log2(q)-
throughput, like q and nm, can only be studied in software simu- input adder sums up the outputs from multiplexer array for the
lations which take weeks to months to reach low BER region. symbol LLR and send the result to a sorter of length q. It takes
Reconfigurability for these important parameters on hardware q cycles to complete the sorting and another nm cycles to com-
involves complicated architecture and schedule changes and plete LLRV writes into the prior memory. Two channels are in-
takes extensive efforts and time repeatedly for every possible stantiated in our design to make full use of the two ports on prior
Start 0 1 Code and Decoder Parameters Elementary Module System Generator in
COUNT == Frame limit Configuration Simulink
n
-1 -1 -1 0 -1 -1 -1 4 -1 -1 dc dc Elementary Library
IDLE RUN RUN IDLE
-1 -1 -1 -1 2 -1 -1 -1 7 -1
37 04
COUNT++ m 48
PG PG m m 27
DD DD
Pre- Position Entry
Toverhead+(n/2)×(q+nm) NB-LDPC code
LUT LUT
Configure
Local Routing
ITER == Iteration limit processing
cycles Q 6 Script Scripts
F/B mem size
nm 16 Decoder Library
Iteration 1 Iteration 2 Iteration ITER [(dc-3)×nm]×[Q+log2(q)]
Prior mem size
Decoder design parameters
H matrix Row 1 H matrix Row 2 H matrix Row m
[n×nm]×[Q+log2(q)]
Col 0 Col 1 Col 2 Col dc/2-1 CN RAM 2+LS-CN+nm cycles FER/BER and Hardware Global Routing
Col dc-1 Col dc-2 Col dc-3 Col dc/2 write utilization Xilinx Virtex FPGA Scripts
0
3.1E-5 @ 3.2dB Evaluation Board
Col 0/1 Col 2 Col dc/2-1 Col dc-2
10
CN RAM 10
-2
Slice
Registers
22356 4.1E-6 @ 3.4dB Emulation Model
read
Slice
-6
LUTs
31092 Matlab Virtex Bit
Col 1 Col dc/2-2 Col dc-3
10
FW read 10
-8
nm = 8 nm = 8 nm = 12
Resource
q = 32 q = 64 q = 16
Slice
13,929 (15%) 16742 (18%) 16,046 (17%)
Registers
Slice
17,210 (17%) 19,511 (19%) 17,908 (17%)
LUTs
Occupied
6,832 (28%) 8,744 (36%) 7,167 (30%)
Slices
BRAMs 55 (26%) 59 (28%) 53 (25%)