Reduced-Latency SC Polar Decoder Architectures: Chuan Zhang, Bo Yuan, and Keshab K. Parhi
Reduced-Latency SC Polar Decoder Architectures: Chuan Zhang, Bo Yuan, and Keshab K. Parhi
Reduced-Latency SC Polar Decoder Architectures: Chuan Zhang, Bo Yuan, and Keshab K. Parhi
Abstract—Polar codes have become one of the most favorable the whole decoder, are also employed. Comparison results have
capacity achieving error correction codes (ECC) along with their shown that the proposed design can achieve only half decoding
simple encoding method. However, among the very few prior latency and twice higher throughput while maintaining
successive cancellation (SC) polar decoder designs, the required comparable hardware complexity as the conventional one.
long code length makes the decoding latency high. In this paper,
conventional decoding algorithm is transformed with look-ahead The remainder of this paper is organized as follows. A brief
techniques. This reduces the decoding latency by 50%. With review of min-sum SC decoding algorithm is provided in
pipelining and parallel processing schemes, a parallel SC polar Section II. In Section III, the systematic algorithm to construct
decoder is proposed. Sub-structure sharing approach is employed the look-ahead scheduling scheme is given in a recursive
to design the merged processing element (PE). Moreover, inspired manner. The parallel polar decoder architectures with gate-
by the real FFT architecture, this paper presents a novel input level design details are proposed Section IV. The performance
generating circuit (ICG) block that can generate additional input
estimation and comparison with the state-of-the-art design are
signals for merged PEs on-the-fly. Gate-level analysis has
demonstrated that the proposed design shows advantages of 50%
presented in Section V. Section VI concludes the paper.
decoding latency and twice throughput over the conventional one
with similar hardware cost. II. REVIEW OF MIN-SUM SC ALGORITHM
Keywords-Polar codes; successive cancellation; look-ahead; Consider an arbitrary polar code with parameter (N, K, A,
sub-structure sharing; on-the-fly. uA c ) [1]. We denote the input vector as u1N , which consists of
a random part uA and a frozen part uA c . The corresponding
I. INTRODUCTION output vector through channel WN is y1N with conditional
probability WN ( y1N | u1N ) . Define the likelihood ratio (LR) as,
Proposed by Arıkan [1], polar codes have been considered
as the first “low complexity” scheme which provably achieves WN(i ) ( y1N , uˆ1i −1 | 0)
the capacity for a fairly wide array of channels. However, most L(Ni ) ( y1N , uˆ1i −1 ) . (1)
WN(i ) ( y1N , uˆ1i −1 | 1)
related research is focused on code performance rather than
efficient decoder design. Among the few literatures on the The a posteriori decision scheme is given as follows. Here Ac
latter topic, [1] proposed a straightforward implementation with denotes the index set of channels associated with frozen bits.
successive cancellation (SC) algorithm, whose complexity is LRs with even and odd indices can be generated by recursively
(Nlog2N). Compared with the belief propagation (BP) applying Eq. (2) and (3), respectively.
algorithm [1]-[2], SC approach is more suitable for hardware A Posteriori Decision Scheme with Frozen Bits
design due to its lower complexity. Several further revised SC
polar decoders with complexity of (N) were presented by [3]. 1: if i ∈ A c then uˆi = ui;
For these conventional polar decoders, decoding a code of 2: else
length N requires 2(N-1) clock cycles. Since modern 3: if L(Ni ) ( y1N , uˆ1i −1 ) ≥ 1 then uˆi = 0;
communication systems require a code length greater than 210
4: else uˆi = 1;
is required, the resulting decoding delay is high. Also,
restricted by the successive schedule, the highest hardware 5: endif
efficiency of an active stage can only be 50%. In order to 6: endif
achieve faster decoding and higher efficiency, the loop
computation is reformulated with look-ahead techniques, which L(2N i ) ( y1N , uˆ12i −1 )
(2)
pre-calculate all possible values of the next code bit and then =[L(Ni ) 2 ( y1N 2 , uˆ1,2io− 2 ⊕ uˆ1,2ie− 2 )]1− 2uˆ2 i−1 ⋅ L(Ni ) 2 ( yNN 2 +1 , uˆ1,2ie− 2 ),
select the correct one with a multiplexer. This paper proposes a
nice recursive time chart construction method which succeeds L(2N i -1) ( y1N , uˆ12i − 2 )
in reducing the decoding latency by 50%. A parallel decoder L(Ni ) 2 ( y1N 2 , uˆ1,2io− 2 ⊕ uˆ1,2ie− 2 ) L(Ni ) 2 ( y NN 2 +1 , uˆ1,2ie− 2 ) + 1 (3)
example is proposed at gate-level with VLSI-DSP techniques. = (i ) .
LN 2 ( y1N 2 , uˆ1,2io− 2 ⊕ uˆ1,2ie− 2 )+L(Ni ) 2 ( y NN 2 +1 , uˆ1,2ie− 2 )
A hardware-efficient merged processing element (PE) and the
input generating circuit (ICG) block, which works best with
The decoding procedure of a polar code example with N = decoding performance and hardware efficiency [3], which is
8 is illustrated in Fig. 1, where Type I and Type II PEs are in very attractive for VLSI designers. Therefore, in the following
charge of Eq. (2) and (3), respectively. The label attached to sections only min-sum SC decoding algorithm is considered.
each PE indicates the clock cycle index when it is activated.
III. LATENCY-REDUCED UPDATING SCHEME
(1)
L (y )
8
8
1
(1)
L (y 1)
1
However, among all pre-stated algorithms, probabilities are
updated according to the same data flow illustrated in Fig. 1,
L(85 ) ( y 18 , uˆ 14 ) L(11) ( y 2 ) which is straightforward but not efficient. In this section, a
uˆ1 + uˆ2 + uˆ 3 + uˆ 4
high-performance scheme for polar decoder, which only needs
L(83) ( y 18 , uˆ12 ) L(11) ( y 3 )
uˆ1 + uˆ 2 half number of clock cycles to obtain the estimated information
L(87 ) ( y 18 , uˆ 16 ) L(11) ( y 4 )
bits, is developed in a recursive manner. Thorough
uˆ 5 + uˆ 6 uˆ 3 + uˆ 4 investigation has revealed that time chart of the straightforward
L(82) ( y 18 , uˆ1 ) L(11) ( y 5 ) SC decoding process for N-bit polar codes can be constructed
û1 in recursive way as follows,
L(86 ) ( y 18 , uˆ 15 ) L(11) ( y 6 )
û5 uˆ 2 + uˆ 4 Recursive Construction of Conventional Time Chart
L(84 ) ( y 18 , uˆ 13 ) L(11) ( y 7 ) 1: initializtion TC= 5\SS;
û3 û2
L(88 ) (y 18 , uˆ17 ) L(11) ( y 8 ) 2: for i = log 2 N , i − −,1 do
û7 û6 û4
3: j = log 2 N − i + 1;
Figure 1: SC decoding process of polar codes with length N = 8. 4: TC = {[ j of Type I, TC], i};
In logarithm domain, Eq. (2) and (3) can be rewritten as: 5: TC = [TC, TC];
L(2Ni ) ( y1N , uˆ12i −1 ) 6: change the leftmost j of Type I with j of Type II;
(4) 7: endfor
=(-1)uˆ2 i−1 L(Ni ) 2 ( y1N 2 , uˆ1,2io− 2 ⊕ uˆ1,2ie− 2 ) + L(Ni ) 2 ( yNN 2 +1 , uˆ1,2ie− 2 ),
8: output TC.
L(2Ni -1) ( y1N , uˆ12i −1 )
=2artanh{tanh[L(Ni ) 2 ( y1N 2 , uˆ1,2io− 2 ⊕ uˆ1,2ie− 2 )] ⋅ (5) Here notation TC = {[C , TC], s} is used to denote the left
tanh[L(Ni ) 2 ( yNN 2 +1 , uˆ1,2ie− 2 )]}. insertion of an array C into the previously arranged time chart
TC at Stage s. Similarly, TC = [TC, TC] simply means
L(Ni ) ( y1N , uˆ1i −1 ) ln L(Ni ) ( y1N , uˆ1i −1 ). (6) duplicating the previous time chart to obtain the new one. i and
Since large size of look-up table (LUT) is required to j are iterative execution indices. “j of Type I” is the short for “j
implement Eq. (5), it is reduced to the min-sum update rule copy/copies of Type I PE(s) is/are activated in that clock cycle”.
with sub-optimal approximation: The corresponding time chart is illustrated in Fig. 2 (a). Since
Stage i is activated 2i times during the whole decoding process,
L(2Ni -1) ( y1N , uˆ12i − 2 ) the total number of clock cycles required is,
sgn[L(Ni ) 2 ( y1N 2 , uˆ1,2io− 2 ⊕ uˆ1,2ei − 2 )]sgn[L(Ni ) 2 ( y NN 2 +1 , uˆ1,2ie− 2 )] ⋅ (7) log 2 N −1
(2log2 N − 1)
min[ L(Ni ) 2 ( y1N 2 , uˆ1,2io− 2 ⊕ uˆ1,2ie− 2 ) , L(Ni ) 2 ( yNN 2 +1 , uˆ1,2ie− 2 ) ]. 2 ∑
i =0
2i = 2 ⋅
2 −1
= 2( N − 1). (8)
Simulation results have shown that the min-sum SC
algorithm, which is LUT free, can keep a balance between
Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Output 1 2 3 4 5 6 7 8
⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓
(1)
L (y )
8
8
1 L ( y , uˆ 1 )
(2)
8
8
1 L ( y , uˆ )
(3)
8
8
1
2
1 L ( y , uˆ )
( 4)
8
8
1
3
1 L ( y , uˆ )
(5)
8
8
1
4
1 L ( y , uˆ )
(6)
8
8
1
5
1 L ( y , uˆ )
(7)
8
8
1
6
1 L ( y 18 , uˆ 17 )
(8)
8
(a)
Clock 1 2 3 4 5 6 7
⇓ ⇓ ⇓ ⇓
L(81) ( y 18 ), L(83 ) (y 18 , uˆ12 ), L(85 ) ( y 18 , uˆ14 ), L(87 ) (y 18 , uˆ16 ),
L(82 ) ( y 18 , uˆ1 ) L(84 ) ( y 18 , uˆ13 ) L(86 ) ( y 18 , uˆ15 ) L(88 ) (y 18 , uˆ17 )
(b)
Figure 2: Conventional and look-ahead decoding time charts for polar codes with N = 8.
required number of clock cycles can be halved to N-1. The time
(1)
L (y 1 )
1
chart construction of the proposed scheme is given as follows:
L ( 2 i −1)
1 ( y , uˆ
8
1
2 i −2
1 ) L(11) ( y 2 )
Recursive Construction of Look-Ahead Time Chart
1: initializtion TC= 5\SS;
L(11) ( y 3 )
2: for i = log 2 N , i − −,1 do
L(11) ( y 4 ) 3: j = log 2 N − i + 1;
4: TC = {[ j of Type I & II, TC], i};
L(11) ( y 5 )
5: if i = 1 then
L(11) ( y 6 )
6: break;
L(12 i ) ( y 18 , uˆ12 i −1 ) L(11) ( y 7 ) 7: endif
L(11) ( y 8 ) 8: TC = [TC, TC];
9: endfor
Figure 3: Pipelined decoder architectures of polar codes with length N = 8. 10: output TC.
However, as mentioned previously the conventional As indicated by Step 4, both types of PEs can work
decoding approach is not suitable for real-time communication simultaneously in the same clock cycle, which not only
systems for two reasons. First, in order to achieve the required shortens the decoding latency by 50% but also improves the
performance, the code length N is usually set as 210-220. An hardware efficiency twice. Moreover, the proposed approach
immediate consequence is the latency of 2(N-1) clock cycles is leads to a construction method in a recursive way. For clear
too large. Second, it is apparent that during the whole decoding understanding of the Russian Doll-like relationship between
process the highest hardware utilization in a specific clock stages, the conventional and look-ahead construction processes
cycle is only 50% (Clock cycle 1). As the stage index increases, have been pointed out with arrows in Fig. 2.
the hardware efficiency will go down as low as 1/N (Clock
cycle log2N), which can be lower than 2-10 for practical IV. ARCHITECTURES FOR LOOK-AHEAD DECODER
applications. Even for the pipelined tree architecture proposed
by [3] in Fig. 3, the highest utilization is only 50% as well, A. Design of Type I PE
which means half PEs are in idle state during each clock cycle. According to the look-ahead scheme, Type I PE is in charge
This dilemma is introduced by the bottleneck of sequential of pre-computing two possible outputs in parallel, which is in
decoding property of SC algorithm. It is noted that if both LLR fact an adder-subtractor. Suppose X and Y are two operands,
inputs for Eq. (4) are available, there can be only two possible and Zin is the carried-in or borrowed-from bit. For the full adder
outputs, depending on what value uˆ2i −1 will take. Therefore, the sum and carry-out bit are represented by S and Cout. The
for Type I PE, given both deterministic inputs, the look-ahead difference and borrow-out produced by the full subtructor are
scheme only needs to pre-compute two output candidates, denoted by D and Bout. The logic equations are as follows:
which can be selected by a multiplexer thereafter. For instance, S = X ⊕ Y ⊕ Z in ; (9) Cout = X ⋅ Y + ( X ⊕ Y ) ⋅ Z in ; (10)
shown in Fig. 1, all possible outputs of Type I PEs labeled by 8
in Stage 1 can be pre-calculated in Clock cycle 1. In other D = X ⊕ Y ⊕ Z in ; (11) Bout = X ⋅ Y + X ⊕ Y ⋅ Z in . (12)
words, for Stage 1 the required computation in Clock cycle 8 Bin
can be incorporated into Clock cycle 1. In the similar way, for D
X
Stage 2 computation in Clock cycle 5 and 12 can be taken care
of in Clock cycle 2 and 9, respectively. Calculation in Clock Bout
cycle 4, 7, 11, and 14 can be re-scheduled into Clock cycle 3, 6, Y
10, and 13 for Stage 3. As a result, only half clock cycles are S
required to implement the same decoding task with help of the
proposed look-ahead schedule. For the 8-bit polar decoder Cin
example shown in Fig. 2 (b), all PEs at Stage 1 are activated Cout
during Clock cycle 1 because both deterministic LLR inputs for
each PE are guaranteed by channel outputs. However, in Clock (a) 1-bit full adder-subtractor.
cycle 2, only PEs labeled with 2 or 5 can be activated, because Bout X
they are the only ones with deterministic inputs. For PEs with
labels of 9 or 12, their inputs are generated by Type I PEs in S D
Y
Stage 1, which have two possible values at this moment. In
order to avoid error propagation caused by pre-computing to Cout
the next stage, those PEs stay idle during Clock cycle 2. Similar (b) 1-bit half adder-subtractor.
schemes apply to further decoding processes. It is clear that the Figure 4: Proposed 1-bit adder-subtractor architectures.
Notice that S and D are actually the same, and X ⋅ Y is an is possible to generate the required uˆ2i −1 using the real FFT-
intermediate term of X ⊕ Y . ( X ⊕ Y ) ⋅ Z in is also a byproduct like signal flow [4]. For instance, all uˆ2i −1 for 8-bit polar
of X ⊕ Y ⊕ Z in . The resulting gate-sharing scheme not only decoder can be generated with the in-place procedure in Fig. 8.
implements parallel processing but also reduces the hardware
consumption. The gate-level structures of 1-bit full and half uˆ1 + uˆ 2 uˆ1 + uˆ 2 + uˆ 3 + uˆ 4
uˆ1
adder-subtractor are depicted in Fig. 4 (a) and (b), respectively.
uˆ 2 uˆ 2 + uˆ 4
The proposed q-bit adder-subtractor, which is illustrated in Fig. uˆ 2
5, requires only less than 57% hardware compared with the uˆ 3 + uˆ 4 uˆ 3 + uˆ 4
conventional one while achieving the same performance. uˆ 3
uˆ 4 uˆ 4
uˆ 4
uˆ 5 + uˆ 6
uˆ 5
uˆ 6
uˆ 6
Figure 8: Flow graph of the proposed IGC.
Figure 5: Proposed Type I PE architectures. The pipelined architecture of the flow graph is illustrated in
Fig. 9, where Ui denotes the unit which is consists of i stage(s):
B. Design of Type II PE
Stage 1 Stage 2
Type II PE with the min-sum algorithm is shown in Fig.6. c1
0 D
sgn input1 uˆ 2 i −1
output sgn mag TtoS 1
q StoT q
q q 0 D
uˆ 2i
input 2 1
CMP mag TtoS U1
Type II PE q q
U2
input1: LN 2 ( y N 2+1, uˆ1,e );
(i ) N 2 i −2
output: L(N2 i -1) ( y 1N , uˆ12 i −2 ). Figure 9: Pipelined architecture for the flow graph in Fig. 8.
input 2 : L(Ni ) 2 ( y 1N 2 , uˆ 12,oi −2 ⊕ uˆ12,ei −2 );
In general, for N-bit length decoder, since the data
Figure 6: Proposed architectures of Type II PE.
structures of IGC are defined recursively for powers of 2, the
C. Design of Merged PEs pipelined architecture can be constructed with the recurrence
Since the comparator in Type II PE is actually a q-bit relationship. The recursion for the general case is shown in Fig.
subtractor, which is also employed by Type I PE, it is possible 10, where module Un can be constructed based on module Un-1
to merge Type I and Type II PEs together with the sub- and N/4 extra XOR-pass elements. For efficient design, memory
structure sharing scheme. The detailed structure is as follows: banks are employed instead of flip-flops. Here, n = log2N-1.
Control signal cn can be obtained by down sampling c1 by n.
input1
output 1
input 2 cn
output 2
output 3
O1 I1 O I3
2
L(11) ( y 2 ) O3 I2
L(11) (y 2 ), L(11) (y 10 )"
⎧
O2 uˆ1 + uˆ 2 + uˆ 3 + uˆ 4 ⎪
⎨
O3 I 2 ⎪
O1 I1
L(11) ( y 3 ) ⎩
L(12 i −1) ( y 18 , uˆ 12 i −2 ) uˆ1 + uˆ 2 O2
O1 I1
L(11) (y 3 ), L(11) (y 11 )"
⎧ O1 I1 or O3 I2
L(11) ( y 4 )
⎪
⎨ uˆ 5 + uˆ 6 O2
⎪
O2 uˆ 3 + uˆ 4
⎩
(1) O3 I2
L(11) (y 4 ), L(11) ( y 12 )"
O3 I2 O1 I1
L (y 5 )
1
L (2 i )
1 ( y , uˆ
8
1
2 i −1
1 ) O2
O1 I1 O3 I2
L(11) ( y 6 )
O2
uˆ 2 + uˆ 4 O1 I1 L(11) (y 5 ), L(11) ( y 13 )"
O3 I 2 O1 I1
L(11) ( y 7 ) O2
uˆ 2 or uˆ 6 O2
TABLE I NUMBER OF ACTIVE MERGED PES IN EACH CLOCK CYCLE According to Table II, the given design only requires half
Clock cycle latency as the reference does, while achieving twice higher
Input
1 2 3 4 5 6 7 8 throughput. And similar amount of hardware is required by the
C1 4 2 1 1 2 1 1 proposed one. Further discussion can show that the look-ahead
C2 4 2 1 1 2 1 1 approach is suitable for other SC decoders..
V. COMPARISON OF LATENCY AND HARDWARE
VI. CONCLUSION
In this section, the proposed polar decoder is compared
A novel look-ahead SC decoding schedule for polar codes
with the state-of-the-art reference. For the sake of fairness, both
is proposed in this paper, which can halve the decoding latency
decoders have the same number of PEs. Since [3] failed to
required by conventional approaches. For efficient hardware
provide details of the uˆs computation block, the counterpart of
implementation issue, a merged PE and an IGC block are
IGC, only comparison for the rest blocks is conducted.
presented. Compared with its conventional counterpart, the
TABLE II COMPARISON OF DIFFERENT POLAR DECODERS parallel decoder example can halve the decoding latency and
Different designs Proposed design Line design [3] double the throughput with similar hardware consumption.
Hardware consumption (q-bit quantization)
# of Merged PEs N/2 N/2
REFERENCES
XOR 9q 11q-3
1 PE REG 0 1 [1] E. Arikan, “Channel polarization: a method for constructing capacity-
MUX 6q 5q achieving codes for symmetric binary-input memoryless channels,”
# of IGCs 2 –– IEEE Trans. on Inf. Theory, vol. 55, no. 7, pp. 3051-3073, July 2009.
XOR N/2-1 –– [2] E. Arkan, “A performance comparison of polar codes and Reed-Muller
1 IGC RAM N/2-2 –– codes,” IEEE Commun. Lett., vol. 12, no. 6, pp. 447-449, June 2008.
MUX N/2-2 –– [3] C. Leroux, I. Tal, A. Vardy, and W. J. Gross, “Hardware architectures
# of other REGs q(9N/2+4) q(N-1) for successive cancellation decoding of polar codes,” in Proc. Int. Conf.
# of other MUXs q(N+2) 3q(N/2-1) Acoust., Speech, and Sig. Proc. (ICASSP), pp. 1665-1668, May 2011.
XOR† ~17qN/2 ~(19q-3)N/2 [4] M. Garrido, K. K. Parhi, and J. Grajal, “A pipelined FFT architecture for
Total real-valued signals,” IEEE Trans. Circuits Syst. I: Reg. Papers, vol. 56,
REG ~9qN/2 ~(q+1/2)N
no. 12, pp. 2634-2643, Dec. 2009.
Decoding schedule
Latency N 2(N-1) [5] Xinmiao Zhang and Fang Cai, “Efficient Partial-Parallel Decoder
Architecture for Quasi-Cyclic Nonbinary LDPC Codes,” IEEE Trans.
Normalized throughput 2 1
Circuits Syst. I: Reg. Papers, vol. 58, no. 2, pp. 402-414, Feb. 2011.
†MUX is converted to XOR with the standard proposed in [5].