An Fpga Implementation of Successive Cancellation List Decoding For Polar Codes
An Fpga Implementation of Successive Cancellation List Decoding For Polar Codes
a thesis submitted to
the graduate school of engineering and science
of bilkent university
in partial fulfillment of the requirements for
the degree of
master of science
in
electrical and electronics engineering
By
Altuğ Süral
January 2016
An FPGA Implementation of Successive Cancellation List Decoding
for Polar Codes
By Altuğ Süral
January 2016
We certify that we have read this thesis and that in our opinion it is fully adequate,
in scope and in quality, as a thesis for the degree of Master of Science.
Erdal Arıkan(Advisor)
Orhan Arıkan
Levent Onural
Director of the Graduate School
ii
ABSTRACT
AN FPGA IMPLEMENTATION OF SUCCESSIVE
CANCELLATION LIST DECODING FOR POLAR
CODES
Altuğ Süral
M.S. in Electrical and Electronics Engineering
Advisor: Erdal Arıkan
January 2016
Polar Codes are the first asymptotically provably capacity achieving error correc-
tion codes under low complexity successive cancellation (SC) decoding for binary
discrete memoryless symmetric channels. Although SC is a low complexity algo-
rithm, it does not provide as good performance as a maximum-likelihood (ML)
decoder, unless sufficiently large code block is used. SC is a soft decision decod-
ing algorithm such that it employs depth-first searching method with a divide
and conquer approach to find a sufficiently perfect estimate of decision vector.
Using SC with a list (SCL) improves the performance of SC decoder such that
it provides near ML performance. SCL decoder employs beam search method as
a greedy algorithm to achieve ML performance without considering all possible
codewords. The ML performance of polar codes is not good enough due to the
minimum hamming distance of possible codewords. For the purpose of increas-
ing the minimum distance, cyclic redundancy check aided (CRC-SCL) decoding
algorithm can be used. This algorithm makes polar codes competitive with state
of the art codes by exchanging complexity with performance. In this thesis,
we present an FPGA implementation of an adaptive list decoder; consisting of
SC, SCL and CRC decoders to meet with the tradeoff between performance and
complexity.
iii
ÖZET
SIRALI ELEMELİ VE LİSTELİ KUTUPSAL
KODÇÖZÜCÜ’NÜN FPGA UYGULAMASI
Altuğ Süral
Elektrik ve Elektronik Mühendisliği, Yüksek Lisans
Tez Danışmanı: Erdal Arıkan
Ocak 2016
iv
Acknowledgement
I would like to thank my supervisor, Prof. Erdal Arıkan for his persistent support,
invaluable guidance, encouragement and endless patience during my thesis.
I express deep and sincere gratitude to Prof. Orhan Arıkan and Dr. Ali Ziya
Alkar for their valuable suggestions and kindness.
I also thank Bilkent University for providing me an essential opportunity with
a sophisticated research environment.
It is my privilege to have a supportive and lovely mother, Defne Süral. Without
her supports, I would not complete my thesis.
I am extremely lucky to be with Gökçe Tuncer, who has a big heart and an
agile mind. I would like to thank her for some ideas to improve my thesis.
v
Contents
1 Introduction 1
1.1 What are Polar Codes? . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summary of Main Results . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Polar Codes 6
2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Channel Polarization . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Channel Combining . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Channel Splitting . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Code Construction . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Encoding of Polar Codes . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Successive Cancellation (SC) Decoding of Polar Codes . . . . . . . 13
2.5.1 Successive Cancellation Decoding of Polar Codes . . . . . 16
2.5.2 Successive Cancellation List (SCL) Decoding of Polar Codes 20
2.5.3 Adaptive Successive Cancellation List Decoding of Polar
Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Comparison between Floating-point and Fixed-point Sim-
ulations of the SC Decoder . . . . . . . . . . . . . . . . . . 24
2.6.2 Performance Loss due to Min-sum Approximations in the
SC Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.3 Fixed-point Simulations of the SCL Decoder . . . . . . . . 25
2.6.4 Fixed-point Simulations of the Adaptive SCL Decoder . . 28
vi
CONTENTS vii
4 Conclusion 66
List of Figures
viii
LIST OF FIGURES ix
x
List of Abbreviations
BS bitonic sorter.
CC clock cycle.
CL control logic.
DC decoding cycle.
DU decision unit.
xi
List of Abbreviations xii
FF flip-flop.
LL log-likelihood.
LR likelihood ratio.
ML maximum likelihood.
PE processing element.
PU processing unit.
List of Abbreviations xiii
SC successive cancellation.
SM sign-magnitude.
TC twos complement.
L list size.
R code rate.
xiv
Chapter 1
Introduction
Shannon defines channel capacity as the maximum rate of information, which can
be reliably transmitted over a communication channel [1]. He also shows that the
channel capacity can be achieved by a random code construction method. With a
code rate smaller than the channel capacity, a communication system encounters
negligible errors. It has always been a challenge to achieve channel capacity
with low complexity algorithms. Polar coding is a method that achieves channel
capacity with low complexity encoding and decoding.
Polar codes are a class of capacity-achieving linear forward error correction (FEC)
block codes [2]. The complexity of both encoding and successive cancellation (SC)
decoding of polar codes are O(N log N ), where N is the code block length. The
recursive construction of both encoder and the SC decoder enables neat processing
structures, component sharing and an efficient utilization of the limited sources.
Polar codes provide a flexible selection of the code rate with 1/N precision such
that an arbitrary code rate can be used without reconstructing the code. Polar
codes are channel specific codes that means a polar code designed for a particular
1
channel might not has a good performance for other channels. The important
properties of polar codes are:
1
• Adjustable code rate with N
precision without generating the code again.
Decoding of polar codes is an active research problem. There are several de-
coding methods in literature such that these methods are SC [2], SCL [4], SC
stack [5] and belief propagation [6]. The scope of this thesis includes SC and SCL
methods. Due to low complexity encoding and decoding methods, implementa-
tion of polar codes at long block lengths is feasible for practical communication
systems. In contrast, the noticeable concern for polar codes at long block lengths
is the decoding latency due to strong data dependencies. In this thesis, we con-
sider moderate code block lengths such as 1024 and 256 in order to enhance finite
length performance with limited latency and resource usage. Although the SC
decoder asymptotically achieves channel capacity as N increases, the superior
performance of the SC decoder decays at short and moderate code block lengths
due to poor polarization. For this reason, the SC decoder does not provide as
good performance as a maximum likelihood (ML) decoder at short and moderate
block lengths. To overcome this issue, it is necessary to add more features to
the algorithm such as tracking multiple possible decision paths instead of one
such that the SC decoder does. At this point, SCL decoding algorithm emerges
2
[4]. This algorithm uses beam search method for exploring the possible decoding
paths efficiently. It is considerable as a greedy algorithm that approaches ML
performance with sufficiently large list size, L. Since considering all 2N R pos-
sible decoding paths is impractical and too complex, the SCL algorithm has a
restricted complexity such that at most L best possible paths can be traced. In
that way, the algorithm operates with O(L N log N ) computational complexity.
The error correction performance is further improved by combining the SCL al-
gorithm with a cyclic redundancy check (CRC) code. At the end of decoding,
the SCL decoder selects a CRC valid path from among L surviving paths.
Polar codes are proved that they achieve channel capacity under SC decoding al-
gorithm for symmetric binary discrete memoryless channels (B-DMCs) [2]. Due
to the sequential nature of the SC decoding algorithm, the hardware implemen-
tation is significantly challenging. In this thesis, we try to overcome this issue by
dividing the algorithm into simpler modules. SC algorithm provides low complex-
ity O(N log N ) decoding, however it does not provide as good performance as a
ML decoder at short and moderate code block lengths. The performance of the
SC algorithm can be improved by using SCL decoding algorithm, which tracks
L best decoding paths together. The performance can be further improved by
introducing CRC to SCL decoding by selecting a CRC valid path among L best
decoding paths at the end of decoding. However, SCL decoding algorithm suffers
from long latency and low throughput due to high complexity, O(L N log N )
calculations as L and N increases. The throughput of the SCL can be improved
by using an adaptive decoder, which provides the SC throughput with the SCL
performance. Data flow of the adaptive decoder is shown in Figure 1.1.
The adaptive SCL decoder has three main components, SC, SCL and CRC
decoders. Initially, the SC decoder is activated and a hard decision estimate
vector is calculated. After that, the CRC decoder controls whether the hard
decision vector is correct. If the CRC is valid, the hard decision vector is quite
3
Figure 1.1: Data flow of the adaptive decoder.
likely to be the correct information vector. In this case, the adaptive decoder is
immediately terminated without the activation of the SCL decoder. In other case,
when the CRC is invalid, the SCL decoder is activated and L information vector
candidates are generated. Among these candidates, the CRC decoder selects a
candidate, which has a valid CRC vector. If more than one CRC vector candidate
is valid, the most probable one among these candidates is selected. Lastly, when
none of the candidates has a valid CRC vector, the CRC decoder selects the most
probable decision estimation vector to reduce BER.
The effect of list size on the frame error rate (FER) performance of polar codes
is shown in Figure 1.2. In this simulation, binary phase shift keying (BPSK)
modulated symbols are transmitted over binary additive white Gaussian noise
channel (BAWGNC). There is significant performance gain, which is more than
1 dB between SC decoder and SCL decoder. When the list size increases the
performance improves, however the rate of improvement decreases.
4
Figure 1.2: FER performance of the SCL decoder N = 1024, K = 512.
5
Chapter 2
Polar Codes
2.1 Notations
Upper case italic letters, such as X and Y denotes random variables and their
realizations are denoted by lower case italic letters (e.g., x, y). A length-N row
vector of u is denoted by uN and its sub-vector (ui , ui+1 , ..., uj ) is denoted by uji .
Uppercase calligraphic symbols denote sets (e.g., X , Y). Time complexity of an
algorithm is denoted by Υ and space complexity is denoted by ζ. Logarithm of
base-2 and natural logarithm is represented by log (·) and ln (·), respectively. The
logarithmic likelihood information is represented by δ0 for ln(W (y|x = 0)) and
δ1 for ln(W (y|x = 1)), where x denotes the encoder output and y denotes the
δ0
output of channel, W . The ratio of δ1
is called log-likelihood ratio (LLR) and it
is represented by λ.
6
2.2 Preliminaries
Note that the BSC(p), the BEC() and the BAWGNC(σ) are all memoryless.
XX 1 W (y|x)
I(W ) , W (y|x) log 1 . (2.2)
y∈Y x∈X
2 2
W (y|0) + 21 W (y|1)
Note that the symmetric capacity is the measure of rate. For symmetric channels,
I(W ) equals to the Shannon capacity, which is the upper bound on the code rate
to provide reliable communication.
7
Definition 6. The Kronecker product of m x n matrix A and q x p matrix B is
the (mp) x (nq) block matrix C that defined as
A11 B · · · A1n
. .. ..
C ,A⊗B = .
. . . . (2.4)
Am1 B · · · Amn B
(i)
Channel polarization operation creates N synthetic channels {WN : 1 ≤ i ≤ N }
from N independent copies of the B-DMC W [2]. Polarization phenomenon
(i)
decomposes the symmetric capacity, I(WN ) of these synthetic channels towards
(i)
to 0 or 1 such that I(WN ) ' 0 implies that the ith channel is completely noisy
(i)
and I(WN ) ' 1 implies that the ith channel is perfectly noiseless. The capacity
separation enables to send information (free) bits through the noiseless channels
and redundancy (frozen) bits through the noisy channels.
Let A be the information set and Ac be the frozen set. The input vector, uN
1
consists of both information bits uA and frozen bits uAc such that uA ∈ X K and
uAc ∈ X N −K .
8
u1 W y1
u2 W y2
W2
(↑) (↓)
A combined B-DMC W2 is split back into two channels W2 and W2 by channel
splitting operation. The transition probabilities of these channels are
(↑) 1 X
W2 (y12 |u1 ) = W (y1 |u1 ⊕ u2 )W (y1 |u2 ) (2.6)
2
u2 ∈{0,1}
(↓) 1
W2 (y12 , u1 |u2 ) = W (y1 |u1 ⊕ u2 )W (y2 |u2 ). (2.7)
2
The transition probabilities are calculated in consecutive order from the top
splitting operation to the bottom splitting operation, because the decision bit u1
must be known before the bottom splitting operation.
9
2.3.3 Code Construction
The aim of the polar code construction is to determine A and Ac sets according to
the capacity of individual channels. Since polar codes are channel specific codes,
the code construction may differ from channel to channel. Channel parameters,
such as σ for BAWGNC and for binary erasure channel (BEC) are an input to
a code construction method. For BEC W , code construction for (N = 8, K = 4)
polar code with an erasure probability = 0.3 is shown in Figure 2.2.
(1)
Initially, the reliability of the smallest channel, W1 sets as = 0.3. After
(1) (1)
that the reliability of the first length-2 channel, W2 is calculated as Z(W2 ) =
(1) (1) 2 (i)
2Z(W1 ) − Z(W1 ) , where Z(WN ) is the erasure probability of the ith length-
N channel starting from top. At the same time, the second length-2 channel
(2) (1) 2
can be calculated as Z(W2 ) = Z(W1 ) . In general, the recursive formula for
calculating top and bottom channels is
10
At the end of stage log N (in this case log N = 3), the erasure probability of all
length-N channels appears. At this point, the channels which has the lowest K
erasure probabilities set as free and others set as frozen. The algorithm has log N
stages and performs N − 2 calculations. Polar code construction for symmetric
B-DMC is performed by using several methods such as Monte-Carlo simulation
[2], density evolution [7] and Gaussian approximation [8]. In this thesis, we use
Monte-Carlo simulation method with ten million trial numbers to determine A
and Ac sets.
Polar codes can be encoded by using simple linear mapping. For the code block
length N the generator matrix, GN is defined as GN = BN F ⊗n for any N = 2n
as n ≥ 1, where BN" is a #bit-reversal matrix and F ⊗n is the nth Kronecker power
1 0
of the matrix F = .
1 1
As a result, the following equations are obtained at the output of 8-bit encoder.
11
u1 x1
u5 x2
u3 x3
u7 x4
u2 x5
u6 x6
u4 x7
u8 x8
x1 = u1 ⊕ u2 ⊕ u3 ⊕ u4 ⊕ u5 ⊕ u6 ⊕ u7 ⊕ u8
x2 = u5 ⊕ u6 ⊕ u7 ⊕ u8
x3 = u3 ⊕ u4 ⊕ u7 ⊕ u8
x4 = u7 ⊕ u8
(2.10)
x5 = u2 ⊕ u4 ⊕ u6 ⊕ u8
x6 = u6 ⊕ u8
x7 = u4 ⊕ u8
x8 = u8
The factor graph representation (Fig.2.3) shows that 8-bit encoder includes twelve
XOR operations. In general, N-bit encoder includes (N/2 log N ) XOR opera-
tions. Let ΥE (N ) denote the time complexity of encoding. Due to recursive
channel combining, a length-N encoder consist of two length- N2 encoders and N
2
binary XOR operations. Therefore, ΥE (N ) is
12
N N
ΥE (N ) = 2ΥE ( )+ (2.11)
2 2
N
= 2ΥE ( ) + Θ(N ) (2.12)
2
(i)
= O(N log N ), (2.13)
The free bits can be observable at the output of a polar encoder by systematic
encoding of polar codes [9].
SC decoding of polar codes is a search problem [10] such that the target is to re-
construct information data from noisy channel output. The search space consists
of all possible codewords, belong to the code space C(N, 2bKc ), where K = N R.
The decoding path can be realized as a reduced binary tree with 2K leafs and
N depth. Frozen bits are the cause of the reduction in this binary tree, because
there is only one decision option at the ith frozen decision step as ûi = ui for
i 6∈ A.
Calculating likelihood of all possible codewords and finding the most probable
codeword could be the first method that minimizes block error probability defined
as Pe = P {ûA 6= uA }. This method is called the ML decoding and uses the
British Museum procedure for searching all possible codewords. Although the
ML decoding method provides decent error correction performance, exponential
complexity makes this method impractical to implement.
Furthermore, instead of searching for all possible codewords, searching for the
13
current best decoding path significantly reduces the decoding complexity. In this
case, depth-first (hill climbing) search method is useful. At each decision step,
there are two decision candidates: ûi = 0 and ûi = 1. The decision between these
two candidates is made with respect to channel information and all previously
decoded bit information. Due to the information gained at each decoding step,
depth-first search becomes hill climbing search and the decoder is not allowed to
change its previous decisions according to current information bits. Although this
restriction reduces the error correction performance of the SC decoder especially
in difficult terrains (low signal to noise ratio (SNR) values), the decoder has
reasonable O(N log N ) complexity. The performance loss is caused by local
best decoding paths, which misguides the decoder to an incorrect decoding path.
The original low complexity SC decoder, [2] uses hill climbing search method for
decoding polar codes.
Lastly, the best search method reduces the complexity of beam search by
tracing the best L encountered node so far. This can produce partially developed
decoding tree. The best search method is also similar to hill climbing in terms
of tracing the best path. The main difference between these methods is the best
search method enables the decoder to change its previous decision according to
current likelihood information. The stack SC algorithm, [5] uses the best search
method to decode polar codes.
A decoding tree example, which consists of all of the mentioned search meth-
ods, is shown in Figure 2.4. In this example, the black paths represent the visited
14
(a) British museum searching. (b) Hill climbing searching.
15
paths and the gray paths represent the ignored decoding paths. All decoding
paths are possible, because all decisions set as free. Likelihoods of decoding paths
are written inside circles and the values on the arrows are the hard decisions at
each decision step. By using the British museum search method (Fig. 2.4a), all
paths are visited. Therefore, the hard decision is the most probable path at the
end of decoding, which is 001 with a probability 0.33. On the other hand, the hill
climbing search (Fig. 2.4b) ends up with a different hard decision, 101 that has a
lower likelihood probability, 0.19. The reason is the first decoding step such that
the SC decoder selects the local best path, however it turns out to be a better
path at the end of the binary decision tree. Unlike the British museum searching,
beam searching traces only two most probable paths instead of eight and still
explores the most probable path (Fig. 2.4c). Lastly, the best search starts with
the paths which has 0.60 and 0.40 probabilities, then explores the path which
has 0.32 probability. Since it is smaller than the stack, the algorithm explores
the other nodes which have 0.28 and 0.35 probabilities respectively. At the end
of best searching, the algorithm reveals the most probable decoding path, which
has 0.33 probability.
16
ΥSCD (N ) = O(N log N ), (2.14)
ζSCD (N ) = O(N ). (2.15)
7 ûi ←− 0
8 else
9 ûi ←− 1
10 return ûA
More specifically, the channel LLR values, γ are calculated for BAWGNC as
17
W (y|x = 0)
γ = ln (2.16)
W (y|x = 1)
−(y−1)2 −(y+1)2
e 2σ
2
e 2σ
2
= ln √ − ln √ (2.17)
2πσ 2 2πσ 2
−(y − 1)2 −(y + 1)2
= − (2.18)
2σ 2 2σ 2
2y
= 2, (2.19)
σ
where BPSK modulation with standard mapping is used to assign an output bit
of encoder x to a transmitted symbol s. For i = {1 ≤ i ≤ N }, the mapping rule
is
1, if xi = 0
si = (2.20)
−1, if x = 1.
i
Three different functions are defined to illustrate the behavior of the SC de-
coder. These functions are called f , g, and d. Firstly, the f function is responsible
for the calculation of top channel splitting operation, defined in Section 2.3.2. The
f function, with likelihood ratio (LR) representation is
where (ii): both numerator denominator are divided by W (y1 |ûa = 1)W (y2 |ûb =
1).
18
The f function, with LLR representation is
f (γa , γb ) = γc (2.25)
γa γb
= 2 tanh−1 ((tanh ( ) tanh ( )) (2.26)
2 2
[11], [12]
≈ sign(γa γb ) min(|γa |, |γb |), (2.27)
where min-sum approximation is defined for BP decoding of LDPC codes [11] and
this approximation is used in SC decoding of polar codes for the first time [12].
The min-sum approximation causes an insignificant performance degradation,
which will be shown in Section 2.6.2.
= δa (1−2ûc ) δb . (2.31)
Lastly, the d function is the decision function, which computes hard decisions
19
from soft decisions such that
ui , if i ∈
/A
W (y,ûi−1
1 |ûi =0)
ûi = 0, if i ∈ A and ≥1 (2.33)
W (y,ûi−1
1 |ûi =1)
1, otherwise.
20
belongs to the frozen set Ac , the ith hard decisions of all L lists are updated with
the frozen decision, ui . In case of a free decision, the decoder checks whether
the current list size is equal to the maximum list size. If they are not equal, the
current list size doubles and the decoder can track likelihoods of both decisions.
In case of all lists are occupied, the decoder sorts 2L likelihoods to continue with
the best L decoding paths. At the end of the last decision step, the decoder
outputs the free bits from the best list as ûA .
16 return ûA
21
2.5.3 Adaptive Successive Cancellation List Decoding of
Polar Codes
The adaptive SCL algorithm consists of SC decoding, SCL decoding and CRC
decoding algorithms. The aim of the algorithm is to increase the throughput
of the SCL decoder in [13], [14]. A high level description of the algorithm is
shown in Algorithm 3. Inputs of the adaptive SCL decoding algorithm are the
received codeword y1N , the code block length N , the information set A, the frozen
bit vector uAc and the list size L. The output of the algorithm is the free bit
vector ûA . At the beginning of the algorithm, the SC decoder calculates a free bit
candidate vector. If the CRC of that vector is true, the algorithm terminates with
the output of the SC decoder. In case of incorrect CRC vector, the algorithm
calls a SCL algorithm with the list L. At this time, SCL algorithm calculates L
hard decision candidate vectors. If one of them has a valid CRC, the algorithm
terminates with that output. If none of them has a valid CRC, the algorithm
terminates with the most probable hard decision candidate vector.
In this section, we present the software simulations of the SC, the SCL and
the adaptive SCL decoding algorithms for N = 1024 and K = 512. Although
fixed-point data types and function approximations reduces the complexity of
implementation, they cause some performance degradation. This performance
degradation is to be insignificant such that the trade-off between performance
and complexity is kept. In this manner, we perform fixed-point and approxi-
mation simulations to illustrate the performance loss in FPGA implementation.
For all simulations, the code is optimized for 0 dB by using Monte-Carlo code
construction model with 10000000 trials [2]. The channel model is BAWGNC for
all simulations.
22
Algorithm 3: Adaptive Successive Cancellation List Decoding Algorithm
Input: received codeword, y1N
Input: code block length, N
Input: information set, A
Input: frozen bit vector, uAc
Input: maximum list size, L
Output: estimated information bits, ûA
Variable: j //valid CRC vector of SCD
Variable: k //valid CRC vector of SCLD
1 begin
2 ûA ←− Successive Cancellation Decoding (y1N , N , A, uAc )
3 j ←− Cyclic Redundancy Check Decoding (ûA )
4 if j is true then
5 return ûA
6 else
7 ûL,A ←− Successive Cancellation List Decoding (y1N , N , A,
uAc , L)
8 for l ← 1 to L do
9 k ←− Cyclic Redundancy Check Decoding (ûl,A )
10 if k is true then
11 ûA ←− ûl,A
12 return ûA
13 ûA ←− û1,A
14 return ûA
23
2.6.1 Comparison between Floating-point and Fixed-
point Simulations of the SC Decoder
In this section, we use P bit precision for both channel input and internal LLR
values of the SC decoder. We use P = 32 for the floating-point simulations. The
BER and FER performance results are shown in Figure 2.5 and 2.6 respectively.
The performance difference between 32-bit floating-point precision, 6-bit and 5-
bit fixed-point precision is insignificant. When 4-bit LLR precision is used, a
noticeable performance degradation up to 1dB occurs. This performance degra-
dation becomes significant when energy per bit to noise power spectral density
ratio (Eb/No) increases.
10 0
P = 32 bits
P = 6 bits
P = 5 bits
P = 4 bits
10 -1
10 -2
BER
10 -3
10 -4
10 -5
10 -6
0 1 2 3 4 5 6
Eb/No (dB)
Figure 2.5: BER performance of the SC decoder for different bit precision (P ), N =
1024, K = 512.
24
10 0
P = 32 bits
P = 6 bits
P = 5 bits
P = 4 bits
10 -1
10 -2
FER
10 -3
10 -4
0 1 2 3 4 5 6
Eb/No (dB)
Figure 2.6: FER performance of the SC decoder for different bit precision (P ), N =
1024, K = 512.
The BER performance loss due to min-sum approximation is shown in Figure 2.7.
The FER performance loss due to min-sum approximation is shown in Figure 2.8.
The results indicate that there is insignificant performance loss due to min-sum
approximation of f function in SC decoder.
The fixed-point simulations with of the SCL decoder with the soft decision preci-
sion, P = 6 is shown in Figure 2.9 and 2.10. Note that the SCL decoder does not
use CRC in this simulation. After 3 dB Eb/No, the performance improvement
from L = 2 to L = 32 is not observable. However, there is still a performance
gap between SC and SCL decoders.
25
10 0
min-sum approx.
exact calc.
10 -1
10 -2
BER
10 -3
10 -4
10 -5
10 -6
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Eb/No (dB)
10 0
min-sum approx.
exact calc.
10 -1
10 -2
FER
10 -3
10 -4
26
10 0
SCD
List-2 SCLD
List-4 SCLD
List-8 SCLD
10 -1
List-16 SCLD
List-32 SCLD
10 -2
BER
10 -3
10 -4
10 -5
10 -6
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Eb/No (dB)
Figure 2.9: BER performance of the SC and the SCL decoders, N = 1024, K = 512,
P = 6.
10 0
SCD
List-2 SCLD
List-4 SCLD
List-8 SCLD
List-16 SCLD
10 -1 List-32 SCLD
10 -2
FER
10 -3
10 -4
Figure 2.10: FER performance of eh SC and the SCL decoders, N = 1024, K = 512,
P = 6.
27
2.6.4 Fixed-point Simulations of the Adaptive SCL De-
coder
For adaptive SCL decoder, we made simulations to determine the input precision
Pi of likelihood values. The BER and FER simulation results are shown in Figure
2.11 and 2.12 respectively. There is significant performance loss when the input
of adaptive SCL decoder has Pi = 3 bits. When Pi = 4, there is up to 0.5 db
performance loss due to inadequate input precision. For other input bit precisions
(Pi = 5 and Pi = 6), we observed an insignificant performance loss.
10 0
Pi = 3
Pi = 4
10 -1 Pi = 5
Pi = 6
10 -2
10 -3
BER
10 -4
10 -5
10 -6
10 -7
0 0.5 1 1.5 2 2.5
Eb/No (dB)
28
10 0
Pi = 3
Pi = 4
Pi = 5
-1 Pi = 6
10
10 -2
FER
10 -3
10 -4
10 -5
0 0.5 1 1.5 2 2.5
Eb/No (dB)
non-systematic polar codes under the SC decoding. The FER performance results
is shown in Figure 2.14. We observed that the FER performance of systematic
and non-systematic polar codes under SC decoding are almost identical.
For decoding of polar codes, we presented the SC, the SCL and the adap-
tive SCL algorithms. We showed some fixed-point software simulation results to
29
10 0
non-systematic polar code
systematic polar code
10 -1
10 -2
BER
10 -3
10 -4
10 -5
10 -6
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Eb/No (dB)
10 0
non-systematic polar code
systematic polar code
10 -1
FER
10 -2
10 -3
10 -4
0 0.5 1 1.5 2 2.5 3 3.5 4
Eb/No (dB)
30
demonstrate the performance loss due to approximations, input bit precisions of
SC and adaptive SCL decoders. As a result, we observed an insignificant per-
formance loss due to approximations and an input LLR bit precision more than
5 bits. For the adaptive SCL decoder, when the input log-likelihood (LL) bit
precision is more than 5 bits, it does not cause an observable performance loss.
According to these results, we will use systematic coding with P = 6 input LLR
and LL bit precisions for our adaptive SCL decoder FPGA implementation, which
we will present in the next chapter.
31
Chapter 3
32
3.1.1 Successive Cancellation Decoder Algorithms and
Implementations
33
type, i the is stage number and j is the element number in a stage. In general,
each stage requires N computation blocks to calculate N hard decisions that
takes 2N − 2 clock cycles (CCs). This is the conventional decoding cycle (DC)
of the SC algorithm. Using N log N PEs is a primitive idea to maximize the
computation speed, however data dependencies caused by successive nature of
the decoder enables using less PEs without spending extra CCs. Since maximum
N/2 PEs is activated for one CC, it is unnecessary to implement N PEs for each
stage. At this point, pipeline tree architecture emerges.
Figure 3.1: Data flow graph of forward processing for successive cancellation decoder,
N = 8.
The utilization of PEs further increases with the expense of extra CCs in order
to reduce complexity by using less PEs. The semi-parallel architecture, in [17],
uses v PEs such that 1 < v < N/2. The stages from 1 to log N − log v − 1 need
34
more than v PEs; the stage, log N − log v needs v PEs and the remaining stages
need less than v PEs to calculate internal soft decisions in one CC. Therefore,
more than one CC is spent during the first log N − log v − 1 stages with the fully
utilization of v PEs. Since the activation of these stages are less frequent than
N N
the other stages, the time expense is tolerable such that extra j
log 4j + 2 CCs
are necessary. Thus, the DC increases.
An another approach to reduce the complexity is the two phase decoder [18],
√ √
which implements N − 1 PEs like a N decoder with the tree architecture.
In two-phase SC decoder architecture, the decoder is divided into two phases:
log N
phase-1 for the first stages and phase-2 for the remaining stages. Through-
2
√
out decoding a code block, phase-1 and phase-2 stages are activated N times.
√
Therefore, N hard decisions emerge at the end of a phase-2 stage. In total,
√
additional N + N log2N CCs are necessary to calculate all hard decisions without
using the 2-bit decoding method, which will be explained in the following section.
35
decode. At this point fast-SC (FSC) algorithm emerges [25]. In this algorithm,
rate-0, rate-1, SPC and REP special code segments are detected and after detec-
tion, an ML decoder decodes these constituent code segments. Decomposition of
code segments and detection of special code segments are shown in Figure 3.2.
Figure 3.2: Decomposition of code segments and detection of special code segments,
N = 8.
36
[29], [13]. Initial implementation of the SCL decoder focuses on maximizing the
throughput with a fast radix sorting algorithm with a limited list size in [30], [31]
and [32]. These implementations use log-likelihood LL representation of channel
information to calculate hard decisions. A log-likelihood ratio LLR based list de-
coding implementation is presented to reduce the complexity in [33]. In addition
to these, the conventional latency of a SCL decoder can be reduced by reduced
latency list decoding (RLLD) algorithm such that it enables the detection of rate-
0 and rate-1 constituent codes in [34]. An adaptive list decoding algorithm in
software is presented to reduce the latency in [13]. In this algorithm, the decoder
uses list-1 decoding at the beginning. If CRC is true as a result of list-i decoding,
the algorithm terminates with the output of list-i decoder, otherwise the decoder
is relaunched with the list-2i decoder until the maximum list size is achieved.
37
a length-λi (2 ≤ λi < N , 2 ≤ i < N/2) constituent of an internal hard decision
vector, û. This constituent hard decision vector is used by PSU to calculate
partial sums. At the end of calculation of partial sums, SC decoder completes its
one iteration as a part of û reveals. Unless all free bits ûA reveals, SC decoder
starts the its next iteration and the output of the PSU feedbacks to the PU.
Lastly, the CL is responsible for all control signals to activate each module and
regulate scheduling. Since the SC decoder uses systematic coding 2.4, the output
of SC decoder consists of systematic information bits,x̂A . Note that we set all
frozen bits to zero, uAc = 0 to obtain a decision rule.
To enhance throughput, the SC decoder stores LLR and hard decision informa-
tion in registers instead of block random access memory (BRAM), which provides
slower data access less data width compared to register memory. In this way, pro-
cessors can access data faster with higher parallelization. State information, free
set information and scheduling control information are stored in BRAM, because
control sequences do not need high parallelization. In the following sections, we
will present the details of our PU, DU, PSU and CL implementation.
The PU consists of PEs, which implement f (2.25) and g (2.28) functions. The
aim of the PU is to complete at most N log N computations as fast as possible
with the limited resources. The PU employs PEs as pipeline tree architecture
38
[16], without the PE and the decision element in the last stage of the architecture
to provide minimum 2-bit hard decision in a DC. Therefore, the PUs has log N −1
stages and N −2 PEs. Although the conventional DC of SC takes 2N −2 CCs, our
implementation takes variable CCs depending on the free set A and the frozen
set Ac . We analyzed A and all possible subsets of A to detect rate-0, rate-1, REP
and SPC constituent code segments. When these code segments emerge, the PU
terminates and gives internal LLRs to the DU.
39
Table 3.1: The truth table of a PE.
CLK EN SEL û γa γb γc
- x x x x x no change
↑ 0 x x x x no change
↑ 1 0 x x x sign(γa γb ) min(|γa |, |γb |)
↑ 1 1 0 x x γb + γa
↑ 1 1 1 x x γb − γa
terms of g function, using TC logic minimizes resource usage due to addition and
subtraction operations. At the end of a PE, pipeline registers are used for γc
to meet with the timing requirements of the FPGA. Implementation results of a
PE is shown in Figure 3.2. In these results, input values are also registered to
measure the latency of the critical path delay from an input to an output. As a
result, we choose SM representation to implement the SC decoder.
The DU creates internal hard decisions from γ soft decisions for a constituent
code length λi . The CL activates the DU at the end of PU, when a REP, a SPC,
40
a rate-0 or a rate-1 code segment emerges. We use a decision rule, similar to [25],
to decode constituent codes such that the decision rule is shown in Equation 3.1.
In this decision rule, rλ represents the code rate of length-λ constituent code and
dˆλ is the length-λ internal hard decision vector. Note that this decision rule is
1,i
only valid when all frozen bits are: uAc = 0.
0, if rλ = 0
λ
X
1
0, if rλ = λ
and γj ≥ 0
j=1
Xλ
1
1, if rλ = and γj < 0
λ
j=1
λ
X
ˆλ
d1,i = sign(γi ), if rλ = λ−1
and sign(γj ) = 0
λ
j=1
λ
X
λ−1
not sign(γi ), if rλ = λ
and sign(γj ) 6= 0 and i = argmin(|γk |)
k
j=1
Xλ
λ−1
sign(γi ), if rλ = and sign(γj ) 6= 0 and i 6= argmin(|γk |)
λ
j=1 k
sign(γi ),
otherwise (rλ = 1).
(3.1)
41
Table 3.3: Implementation results of REP and SPC constituent codes, P = 6.
The PSU is responsible for both calculation of the feedback hard decisions and
final output of the SC decoder, x̂A . This can be achieved by encoding length-λi
internal decision vectors blocks, dˆ such that λi is the length of the ith constituent
code. The encoding operation has log N stages and it is the bit-reversed version
of the encoding algorithm that we present in Section 2.4. The last stage decision
vector, dˆ1,log N will be the systematic output of the SC decoder. In general, the
j th stage has N/2j decision vectors as dˆi,j of length-2j . The total number of dˆ
Xυ
blocks loaded to PSU is defined as υ such that λi = N .
i=1
42
The PSU performs logic operations with combinational logic and uses register
arrays for keeping dˆ values. For instance, let N = 8, υ = 3, λ1 = 2, λ2 = 2 and
λ3 = 4, then the data flow of the PSU is shown in Figure 3.5. Since λ1 = 2, the
first input of the PSU is dˆ1,1 = (û1 ⊕ û2 , û2 ). The PSU saves this input vector
in a register array of length-2 and feedbacks these hard decisions to the PU. The
second decision block of length-λ2 is dˆ2,1 = (û3 ⊕ û4 , û4 ). This vector and the
registered vector are combined as dˆ1,2 = (û1 ⊕ û2 ⊕ û3 ⊕ û4 , û3 ⊕ û4 , û2 ⊕ û4 , û4 ).
Similar to the previous encoded hard decision block, this hard decision block is
also given to the PU for forward processing operations. Since λ3 = 4, so there
is no need to calculate dˆ3,1 and dˆ4,1 . The last decision block is dˆ2,2 = (û5 ⊕ û6 ⊕
û7 ⊕ û8 , û7 ⊕ û8 , û6 ⊕ û8 , û8 ) in the second stage and the PSU combines dˆ1,2 and
dˆ2,2 as dˆ1,3 and the decoding is completed. Therefore, the general encoding rule
of the PSU is
CL is responsible for offline detection of REP, SPC, rate-0 and rate-1 constituent
codes and scheduling of the SC decoder. Code detection functions are im-
plemented in very high speed integrated circuit hardware description language
(VHDL) to make a compact system design. As we present in Section 3.1.1, the
conventional DC of the SC decoder is 2N −2 CCs. Due to the detection of special
constituent codes, CL reduces the number of decoding stages as shown in Figure
3.4. In this figure, K is the number of free bits and λM is the maximum number
of REP and SPC constituent code lengths. The current stage number, i of f and
g functions is illustrated as fi and gi and the length of the constituent code, j is
set as repj for repetition, spcj for single parity check, aij for rate-1 and afj for
rate-0 codes. When K = 4, the free set is:A4 = {4, 6, 7, 8}. Similarly, other free
43
sets are A3 = {4, 6, 8} when K = 3 and A7 = {2, 3, 4, 5, 6, 7, 8} when K = 7. If
(K, λM ) is (4,0), CL does not detect the constituent codes, so the conventional
DC does not change. In other cases, CL is allowed to detect constituent codes
and overall DC reduces. The CL sets the DC such that each stage of PU takes
one CC, each iteration of DU takes one CC and PSU operates with combinational
logic, which does not contribute to the DC.
CC
1 2 3 4 5 6 7 8 9 10 11 12 13 14
K,λM
4,0 f1 f2 f3 g3 g2 f3 g3 g1 f2 f3 g3 g2 f3 g3
4,2 f1 f2 af2 g2 rep2 g1 f2 spc2 g2 ai2 - - - -
4,4 f1 rep4 g1 spc4 - - - - - - - - - -
3,2 f1 af4 g1 f2 rep2 g2 ai2 - - - - - - -
7,2 f1 f2 rep2 g2 ai2 g1 ai4 - - - - - - -
−(y−1)2
e 2σ 2
δ0 = W (y|x = 0) = − ln √
2πσ 2
(3.4)
−(y+1)2
e 2σ 2
δ1 = W (y|x = 1) = − ln √ ,
2πσ 2
44
Both δ0 and δ1 can take values between 0 to 2P − 1, represented with P bit
precision. Data flow of the SCL decoder is shown in Figure 3.6. Main modules
of SCL decoder are list processing unit (LPU), list partial sum update (LPSU)
and bitonic sorter (BS). We use an asymmetric BRAM to save δ0 and δ1 from
channel at the beginning of decoding. After that, the processors use this BRAM
to read and write δ0 and δ1 values for internal soft decision calculations. There
are V LPUs in our SCL decoder implementation and each of LPU has one soft
decision router element (SRE) and L list processing elements (LPEs). The aim
of the SRE is to route the output of LL asymmetric BRAM to the input of
LPEs with respect to the pointer information, which is stored in pointer register
array memory. Each LPU takes 4LP bits as input and calculates 2LP bits of LL
information in a pipeline manner. For the last log V stages, the output of the
ith LPU directly feedbacks to the input of the LPU, which has di/2e index for
i = {1 → V } without accessing the BRAM to enable pipeline calculations. A
data buffer is implemented for the feedback operation of LPUs.
For each valid decoding path, SCL decoder calculates path likelihood informa-
tion such that the number of the path likelihood information doubles at each free
decision step, until 2L different valid paths emerge. At this point, the decoder
has not adequate resources to track all 2L paths. Therefore, it sorts them to
find L best paths. If more than one winner path is reproduced from the same
ancestor, two different processors have to access to the same memory and one
of these processors writes to the same memory. This makes memory conflicts
45
due to overwriting. To solve this problem, the primitive idea is to copy all data
history from one memory of a winner path to the memory of a loser path. The
cost of copying information is quite significant, because the amount of data to be
copied is 2LP (N −1) bits for likelihood values and LN bits for the hard decisions.
For the implementation of the SCL decoder, we use P bits LL representation to
calculate internal soft decisions. Therefore, 2P bits are necessary to keep both
likelihood values δ0 for the likelihood of 0 and δ1 for the likelihood of 1. In ad-
dition to that, we use N − 1 memory locations to save 2P likelihood values for
each list to enable internal calculations. During state copying, the SCL decoder
pauses the all processing operations such that it creates additional latency after
each sorting operation. A better solution is to create a pointer memory, which
remembers the valid memory locations for all lists. We use this approach and
create L log N × log L bits of pointer array memory. The pointer memory has
L log N depth, because each list needs a pointer for log N stages. There is L
different memory options to access, thus the data width of the pointer memory
is log L.
The LPU consists of L LPEs and one SRE to compute δ0 and δ1 soft decisions. We
used semi-parallel architecture for processing to minimize the total complexity of
all LPU blocks. Therefore, the SCL decoder has V LPUs. Each LPU reads from
and writes to the asymmetric BRAM, which has 2V × L × P input data width
and 4V × L × P output data width and (2 ∗ N/V ) − 2 address length. The LL bit
precision, P is taken as P = log N + Q, where Q is the channel LL precision, to
avoid overflows during forward processing of log N stages. A LPE can read one
of L different memory locations of BRAM or the data buffer between the LPUs,
however it can write only its own memory location. During the activation of
LPSUs and BS, all LPUs are paused. In case of a loser list after sorting, the SRE
46
does not route the LL information of that list to LPEs to perform g functions.
SRE reads pointer values from pointer register array and route both of output
of the BRAM and data buffer to the input of LPEs. The pointer information is
only necessary for g functions, because f functions always operate in the same
memory. The synthesis results of LPU, SRE and LPE is shown in Figure 3.5.
As L increases, SRE uses more lookup tables (LUTs) and dominates the LPU
implementation.
47
Figure 3.7: The RTL schematic of list processing unit for N = 1024, P = 16 and
L = 4.
48
f0 (δa,0 , δa,1 , δb,0 , δb,1 ) = δc,0
= min(δa,0 + δb,0 , δa,1 + δb,1 ) + ln(1 + e−|δa,0 +δb,0 −δa,1 −δb,1 | ) − ln 2
≈ min(δa,0 + δb,0 , δa,1 + δb,1 ),
(3.5)
∗ ∗ ∗ ∗
g0 (δa,0 , δa,1 , δb,0 , δb,1 , û) = δd,0
= δa,0 + δb,0 + (δa,1 − δa,0 ) û − ln 2 (3.7)
≈ δa,0 + δb,0 + (δa,1 − δa,0 )û,
∗ ∗ ∗ ∗
g1 (δa,0 , δa,1 , δb,0 , δb,1 , û) = δd,1
= δa,1 + δb,1 + (δa,0 − δa,1 ) û − ln 2 (3.8)
≈ δa,1 + δb,1 + (δa,0 − δa,1 )û,
∗ ∗ ∗ ∗
where δa,0 , δa,1 , δb,0 , δb,1 are the pointed LL vectors.
SCL decoder performs g0 and g1 functions after û hard decisions are made.
The SRE reads the pointer information from pointer register array. With respect
to this information, the SRE routes the output of the asymmetric LL BRAM to
L LPEs. This routing operation has P L2 complexity, because a SRE needs 4P L
∗ ∗ ∗ ∗
L − to − 1 demultiplexers to map {δa,0 , δa,1 , δb,0 , δb,1 } LLs to {δa,0 , δa,1 , δb,0 , δb,1 }
pointed LLs for all lists.
49
3.3.2 List Partial Sum Update Logic (LPSU)
The LPSU updates hard decision partial sums for all lists as a feedback to g0 and
g1 functions and creates the systematic data output of the SCL decoder. The
feedback decisions have L(N − 1) bits and the systematic output has LN bits.
The LPSU uses registers as memory instead of BRAM to decrease latency. This
enables to calculate all necessary partial sums in one CC. The LPSU accesses
registers through N − 1 hard decision router elements (HREs) with respect to
pointer information to avoid recalculation of previous hard decisions. The data
flow of the LPSU for N = 4, L = 2 is shown in Figure 3.8. At each decision step
from i = {1 ≤ i ≤ N }, L hard decisions appear for the lth list l = {1 ≤ l ≤ L}
as ûl,i . If the outputs of lth list has lower probability than the other lists, we do
not need the decisions of this list anymore. In addition to that, a list can operate
with the previous hard decisions of an other list. For that operation, the LPSU
accesses the decision memory by using de-multiplexers and calculates the partial
sums.
50
3.3.2.1 Hard Decision Router Element (HRE)
The operation of the HREs and SREs in Section 3.3.1.2 is similar as they access
to the same pointer memory at different times. The only difference is the HRE
takes hard decisions as input. Therefore, the complexity of each HRE decreases
to L2 from P L2 such that each HRE has L L − to − 1 demultiplexers. Accessing
to the pointer memory does not cause memory conflicts, because read operations
occur before write operations.
3.3.3 Sorter
Our SCL decoder needs a sorter to find best L decoding path among 2L at each
free decision stage. That means, we activate the sorter module K times for a
code block. During sorting, the decoder core waits for the best decoding paths
without performing any operations. Therefore, latency of a sorter module has an
important influence on total latency of the SCL decoder. In addition to that, the
probability of each decoding path is represented by two LLs with each of them
has P bit precision. Although these 2L LLs need to be sorted to find the highest
L LLs, we only need L winner indices and their list number. The remaining
Lloser paths are discarded. The output information of sorter is used for making
hard decisions and updating pointer list values.
Let ΥB (L) denote the time complexity of the bitonic sorter with the list size
51
L such that
L
ΥB (L) = ΥB ( ) + log L + 1 (3.9)
2
L
= ΥB ( ) + Θ(log L) (3.10)
2
(i)
= O(log2 L). (3.11)
Let ζB (L) denote the space complexity of the bitonic sorter with the list size
L such that
L
ζB (L)) = 2ζB ( ) + 2L(log L + 1) (3.12)
2
L
= 2ζB ( ) + Θ(L log L) (3.13)
2
(i)
= O(Llog2 L), (3.14)
For instance, let L = 4, the bitonic sorter circuit is shown in Figure 3.9.
There are three stages as s1 , s2 , s3 and these stages has substages such as
s1,1 , s2,2 , s3,2 . Then, total stage number is log(L) + 1 and total substage num-
(log(L)+1)(log(L)+2)
ber is 2
. Each list processing core provides two LLs to the BS. In
addition to that, the input indices are represented by log 2L bit precision from
one to eight such that LL1,1 is the first, LL1,2 is the second and LL4,2 is the eighth
input indices. These indices follow the LL data, throughout the sorter to keep
list history information. The elements in each substage are simple comparators,
which compares the left upper LL value with the left bottom one. The higher LL
value is the output of diamond connection and the lower LL value is the output
of the square connection with their initial index information. Comparators does
not use initial index information, but they use LL values for comparison.
We have implemented three different bitonic sorters as BS, fast bitonic sorter
(FBS) and fast reduced bitonic sorter (FRBS). For the BS implementation, we
use pipeline registers at the end of each substage. For the FBS implementation,
we use pipeline registers for every two substage of the bitonic sorter. For the
52
Figure 3.9: Bitonic sorter circuit for L = 4.
FRBS implementation, the last log L optional substages, s3,2 and s3,3 are not
used. Thus, the output list information is not sorted from the best list to the
worse list. This may introduce some performance degradation to the algorithm
as we will be discussed in Section 3.5.
At the end of s3 , there are four identical decision units (d) to make hard
decisions with respect to indices of sorted LLs. Since we activate the sorter
K − log L times as k = {1 → K − log L} for all lists l = {1 → L = 4}, the output
hard decisions are saved as ul,k . Let the input of a decision unit is t, then the
decision rule is
0, if t mod 2 = 1
ûl,k = (3.15)
1, otherwise.
In addition to that, there are four pointer units (p) to calculate the list number
of output indices. Let the input of a pointer unit is l and the output is a such
that a = d 2l e. The output of a pointer unit determines a routing rule for a LPU.
Similar to the example of L = 4 sorter, the L = 2 sorter is shown as the dotted
square box in the upper left corner of the L = 4 bitonic sorter. A recursive
53
algorithm is used to generate a sorter with respect to L.
The resource usage of the bitonic sorter is shown in Table 3.6. Since FRBS
has advantages in terms of both resource usage and latency, we choose FRBS
to implement the adaptive SCL decoder. RTL schematic of the FBS for L = 2
is shown in Figure 3.10. For the FRBS implementation, the last stage bitonic
comparator in this RTL schematic is not used.
The CRC decoder is activated at the end of both SC and SCL decoders for
detecting whether a decision vector is valid. We use CRC-16-CCITT, x16 + x12 +
x5 + 1 polynomial for the implementation. The input of the CRC decoder is x̂A
at the end of SC decoder and x̂l,A at the end of SCL decoder for l = {1 → L}.
The output of the CRC decoder is boolean such that it is equal to ’1’ if the CRC
is valid, ’0’ otherwise. The CRC decoder is shown in Figure 3.11. In this figure,
there are 16 pipeline registers and 3 XOR gates. At the initial stage, the output
54
Figure 3.10: RTL schematic of the fast bitonic sorter with L = 2.
55
of all registers are zero. After that, an input vector is loaded to din pin from
least significant bit (LSB) to MSB, where the last 16 MSB bits are the CRC
bits. Loading an input vector to the CRC decoder takes K CCs. The final CRC
output is valid, if and only if the output of all 16 flip-flops (FFs) are zero after
the K th CC.
In case of CRC encoding, the last 16 bits set to zero. After all input is loaded,
the output of 16 FFs will be the CRC vector. Therefore, the same circuit can be
used for both CRC encoding and decoding.
56
57
Figure 3.12: RTL schematic of the CRC for K = 512 with two CCs latency.
3.5 Adaptive SCL Decoder Implementation Re-
sults
We implement the adaptive decoder as the combination of SC, SCL and CRC
decoders on the Xilinx Kintex-7 (xc7k325t-2ffg900c) FPGA. The resource usage,
latency and throughput of the adaptive decoder is shown in Table 3.8. For this
implementation, we set input LLR precision of SC decoder to PSCD = 6 bits.
Although the input LL precision of the SCL decoder is Pi = 6 bits, we add extra
log N bits to ensure that the decoder does not overflow. Thus, the internal LL bit
precision of the SCL decoder is PSCLD = 6 + log N bits. The number of LPU set
as V = 4 and we use FRBS to sort LL values in the SCL decoder. The adaptive
SCL decoder performs its maximum latency and minimum throughput when the
CRC is invalid at the end of SC decoder and SCL decoder is activated. The
adaptive SCL decoder performs min. latency and max. throughput, when the
output of the SC decoder has a valid CRC as a result of the CRC decoder. The
increase on the BRAM usage is caused by the input data width of the asymmetric
BRAM in the SCL decoder.
Table 3.8: Implementation results of adaptive successive cancellation list decoder.
The resource percentage of SC, SCL and CRC decoders in the adaptive SCL
decoder implementation is shown in Figure 3.9. As the list size increases, SCL
decoder uses more resources compared to other decoders. For all cases, CRC
decoder uses insignificant amount of resources.
58
Table 3.9: The resource usage percentage of SC, SCL and CRC decoders.
end of the SC decoder. As Eb/No increases, the SCL decoder is less frequently
activated and the throughput of adaptive SCL decoder improves. The throughput
improvement with respect to Eb/No is shown in Figure 3.13 and 3.14.
For N = 1024 and K = 512, the performance of the adaptive SCL decoder
implementation with bitonic sorter is shown in Figure 3.17, 3.18. At this point,
the FRBS causes more than 0.5 dB performance loss when L = 32. The reason
behind this performance loss might be the hard decision candidates, which has
more than one valid CRC at the end of the SCL decoder.
In addition to these, the BER performance of the adaptive SCL with respect to
different internal bit precisions is shown in Figure 3.19 and the FER performance
is shown in Figure 3.20. Although the SCL decoder uses P = Pi + log N internal
59
160
140
120
Throughput (Mbps)
100
FSCD
80 SCLD-L8
Adaptive-SCLD-L8
60
40
20
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Eb/No(dB)
200
180
160
Throughput (Mbps)
140
120
FSCD
SCLD-L4
100 Adaptive-SLCD-L4
80
60
40
20
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Eb/No(dB)
60
10 0
L-32, BS
L-32, FRBS
L-16, BS
L-16, FRBS
10 -1
10 -2
BER
10 -3
10 -4
10 -5
10 -6
0 0.5 1 1.5 2 2.5
Eb/No (dB)
Figure 3.15: BER performance of the adaptive SCL decoder with bitonic sorter,
N = 256, K = 128.
10 0
L-32, BS
L-32, FRBS
L-16, BS
L-16, FRBS
10 -1
10 -2
FER
10 -3
10 -4
10 -5
0 0.5 1 1.5 2 2.5
Eb/No (dB)
Figure 3.16: FER performance of the adaptive SCL decoder with bitonic sorter,
N = 256, K = 128.
61
10 0
L-32, BS
L-32, FRBS
-1 L-16, BS
10
L-16, FRBS
10 -2
10 -3
BER
10 -4
10 -5
10 -6
10 -7
10 -8
0 0.5 1 1.5 2 2.5
Eb/No (dB)
Figure 3.17: BER performance of the adaptive SCL decoder with bitonic sorter,
N = 1024, K = 512.
10 0
L-32, BS
L-32, FRBS
L-16, BS
L-16, FRBS
10 -1
10 -2
FER
10 -3
10 -4
10 -5
10 -6
0 0.5 1 1.5 2 2.5
Eb/No (dB)
Figure 3.18: FER performance of the adaptive SCL decoder with bitonic sorter,
N = 1024, K = 512.
62
LL precision, the P = 11 bit precision is adequate for N = 1024, K = 512,
L = 16, Pi = 6 due to BER and FER performance results.
10 0
P = 10
P = 11
P = 16
10 -1
10 -2
10 -3
BER
10 -4
10 -5
10 -6
10 -7
0 0.5 1 1.5 2 2.5
Eb/No (dB)
Figure 3.19: BER performance of the adaptive SCL decoder with different internal
bit precisions, N = 1024, K = 512, L = 16, Pi = 6
In this chapter, we made a literature survey about SC and SCL decoder algorithms
and implementations. We analyzed the SC decoding architectures in terms of
reducing complexity and increasing throughput. Furthermore, we presented our
adaptive successive cancellation list decoder implementation consisting SC, SCL
and CRC decoders.
The SC decoder has four main modules as processing unit (PU), decision unit
(DU), partial sum update (PSU) and controller logic (CL). For the implementa-
tion of the SC decoder, we use fast SC decoding method in [25]. This provides
the detection of all frozen (rate-0), all free (rate-1), single partiy check (SPC) and
repetition (REP) code segments to increase the throughput of the SC decoder.
These code segments can be decoded with a maximum likelihood (ML) decoder
63
10 0
P = 10
P = 11
P = 16
10 -1
10 -2
FER
10 -3
10 -4
10 -5
0 0.5 1 1.5 2 2.5
Eb/No (dB)
Figure 3.20: FER performance of the adaptive SCL decoder with different internal
bit precisions, N = 1024, K = 512, L = 16
The SCL decoder consists of three main modules as list processing unit (LPU),
list partial sum update (LPSU) and sorter. We use the semi-parallel architecture
in [17] for the implementation of LPU. By this way, we define V LPUs, consist
of total V soft decision router elements (SRE) and V L list processing elements
(LPE). In the implementation of adaptive SCL decoder, we set V = 4. For the
LPSU module, we use tree structure to minimize the latency. Thus, N − 1 hard
decision router elements (HRE) are used for the LPSU module. For the sorter
module, we have implemented there different sorters as bitonic sorter (BS), fast
bitonic sorter (FBS) and fast reduced bitonic sorter (FRBS). We selected FRBS
as our sorter to minimize the latency and resource usage caused by the sorter in
SCL decoder. The FRBS introduces some insignificant performance loss that we
showed in this chapter.
64
of the adaptive SCL decoder up to 225 Mb/s data throughput.
65
Chapter 4
Conclusion
66
SPC, R-0 and R-1 in detail and presented its implementation results. In SCL
decoder, we developed FRBS to reduce sorting latency significantly, compared to
a conventional BS. The FRBS implementation results were analyzed in Section
3.3.3.1. After that, we showed the implementation logic of the CRC decoder.
Since it has insignificant contribution to the total implementation complexity, we
reduced the pipeline stages in Section 3.4. Lastly, we showed implementation
results of the adaptive SCL decoder in terms of resource usage, maximum clock
frequency, memory usage, latency and throughput. The latency and the through-
put of the adaptive SCL decoder varies with respect to operating SNR value. As
a result, we achieved 100 Mb/s throughput approximately at 1.5 dB Eb/No in
N = 256, K = 128, L = 8 parameters. For the length N = 1024, K = 512, L = 4
decoder at 1.25 dB Eb/No, our implementation works faster than 100 Mb/s data
throughput.
For future work, we will reduce the complexity of our implementation and make
it possible to implement adaptive L = 32 SCL decoders on standard commercial
FPGA chips.
67
Bibliography
[3] E. Arıkan and E. Telatar, “On the rate of channel polarization,” in Proc.
IEEE Int. Sym. on Inform. Theory (ISIT), pp. 1493–1495, Jul. 2009.
[4] I. Tal and A. Vardy, “List decoding of polar codes,” in Proc. IEEE Int. Sym.
Inf. Theory (ISIT), pp. 1–5, 2011.
[5] K. Niu and K. Chen, “Stack decoding of polar codes,” Elect. Lett., vol. 48,
pp. 695–596, Jun. 2012.
[8] P. Trifonov, “Efficient design and decoding of polar codes,” IEEE Trans. on
Comm., vol. 60, pp. 3221–3227, Nov. 2012.
[9] E. Arıkan, “Systematic polar coding,” IEEE Comm. Lett., vol. 15, pp. 860–
862, Aug. 2011.
68
[10] P. H. Winston, Artificial Intelligence. Addison-Wesley publishing company,
1993.
[13] B. Li, H. Shen, and D. Tse, “An adaptive successive cancellation list decoder
for polar codes with cyclic redundancy check,” IEEE Comm. Lett., vol. 16,
pp. 2044–2047, Dec. 2012.
[18] A. Pamuk and E. Arıkan, “A two phase successive cancellation decoder archi-
tecture for polar codes,” in Proc. IEEE Int. Sym. on Inform. Theory (ISIT),
pp. 957–961, Jul. 2013.
69
[19] C. Zhang, B. Yuan, and K. K. Parhi, “Reduced-latency sc polar decoder
architectures,” in Proc. IEEE Int. Conf. on Comm. (ICC), pp. 3471–3475,
Jun. 2012.
[29] K. Niu and K. Chen, “CRC-aided decoding of polar codes,” IEEE Comm.
Lett., vol. 16, pp. 1668–1671, Oct. 2012.
70
[30] A. Balatsoukas-Stimming and A. Burg, “Tree search architecture for list sc
decoding of polar codes.” Mar. 2013.
[32] C. Zhang, X. You, and J. Sha, “Hardware architecture for list successive can-
cellation polar decoder,” in Proc. IEEE Int. Sym. on Circuits and Systems
(ISCAS), pp. 209–212, Jun. 2014.
[34] J. Lin, C. Xiong, and Z. Yan, “A reduced latency list decoding algorithm for
polar codes.” Oct. 2014.
[35] J. Lin and Z. Yan, “Efficient list decoder architecture for polar codes,” in
Proc. IEEE Int. Symp. Circuits and Systems (ISCAS), pp. 1022–1025, Jun.
2014.
[37] Y. Fan, J. Chen, C. Xia, C. Tsui, J. Jin, H. Shen, and B. Li, “Low-latency
list decoding of polar codes with double thresholding,” CoRR, Apr. 2015.
71