0% found this document useful (0 votes)
34 views

A Low-Complexity Three-Error-Correcting BCH Decoder With Applications in Concatenated Codes

Uploaded by

Mihir Saha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

A Low-Complexity Three-Error-Correcting BCH Decoder With Applications in Concatenated Codes

Uploaded by

Mihir Saha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

SCC 2019 · February 11 – 14, 2019 in Rostock, Germany

A Low-Complexity Three-Error-Correcting BCH


Decoder with Applications in Concatenated Codes
Jürgen Freudenberger,1 Mohammed Rajab,1 Sergo Shavgulidze2
1
Institute for System Dynamics, HTWG Konstanz, University of Applied Sciences, Germany
2
Faculty of Power Engineering and Telecommunications, Georgian Technical University, Georgia
Email: {jfreuden, mrajab,}@htwg-konstanz.de, [email protected]

Abstract—Error correction coding (ECC) for optical communi- and all error positions within a single clock cycles. Whereas
cation and persistent storage systems require high rate codes that the calculation of the error location polynomial is often per-
enable high data throughput and low residual errors. Recently, formed using the Berlekamp-Massey algorithm (BMA) which
different concatenated coding schemes were proposed that are
based on binary Bose-Chaudhuri-Hocquenghem (BCH) codes requires several iterations. Alternatively, decoders based on
that have low error correcting capabilities. Commonly, hardware Peterson’s algorithm [20] were proposed in [21], [22], [10],
implementations for BCH decoding are based on the Berlekamp- [11]. Such decoders can be more efficient than the BMA for
Massey algorithm (BMA). However, for single, double, and triple BCH codes with small error correcting capabilities, i.e. single,
error correcting BCH codes, Peterson’s algorithm can be more double, and triple error correcting codes.
efficient than the BMA. The known hardware architectures of
Peterson’s algorithm require Galois field inversion. This inversion In this work, we propose an inversion-less version of Pe-
dominates the hardware complexity and limits the decoding terson’s algorithm for triple error correcting BCH codes. This
speed. This work proposes an inversion-less version of Peterson’s algorithm is more efficient than the decoders employing Galois
algorithm. Moreover, a decoding architecture is presented that is field inversion [21], [11]. Moreover, the proposed inversion-
faster than decoders that employ inversion or the fully parallel less Peterson’s algorithm provides more flexibility regarding
BMA at a comparable circuit size.
the hardware implementation and enables pipelining to speed
I. I NTRODUCTION up the decoding. A decoding architecture for such a pipelined
architecture is presented.
Concatenated codes using BCH codes of moderate length The paper is organized as follows. In the next section, we
and with low error correcting capability have recently be introduce the notation and briefly discuss Peterson’s algorithm,
applied for error correction in optical communication as well which is the basis of the proposed decoding procedure. The
as in storage systems. Such coding systems require high code calculating of the error location polynomial for single, double,
rates, very high throughput with hard-input decoding, and low and triple errors along with the proposed inversion-less algo-
residual error rates. These requirements can be meet by gener- rithm are presented in Section III. In Section IV, we propose a
alized concatenated codes, product codes, half-product codes, hardware architecture for this algorithm and compare its speed
or staircase codes. For instance, generalized concatenated and the array consumption with other algorithms.
codes with inner BCH codes were investigated in [1], [2], [3],
[4], [5]. Moreover, product code constructions based in BCH II. P ETERSON ’ S ALGORITHM
codes were proposed in [6], [7], [8], [9]. Hardware architec- In this section, we briefly revise Peterson’s algorithm and
tures for such codes were proposed for instance in [10], [11], introduce the notations. The received vector is r(x) = v(x) +
[12], [13], [14]. Similarly, implementations for fast decoding e(x), where v(x) = v0 + v1 x + . . . + vn−1 xn−1 is a codeword
of staircase codes requires fast BCH decoding [15], [16]. of length n and e(x) = e0 + e1 x + . . . + en−1 xn−1 is the error
Due to the required code rates, BCH codes that can only vector. S1 , S2 , . . . , S2t−1 denote the syndrome values which
correct single, double, or triple errors are used. The decoding are defined as
of the concatenated codes typically requires multiple rounds Si = r(αi ) = e(αi ), (1)
of BCH decoding. Hence, the achievable throughput depends where α is the primitive element of the Galois field GF (2m ).
strongly on the speed of the BCH decoder. Moreover, BCH For binary BCH codes, the following relation holds
codes that correct only two or three errors are used in random-
access memory (RAM) applications [17], [18], [19], that S2i = Si2 . (2)
require high data throughput and a very low decoding latency.
Let ν be the actual number of errors and t is the error
BCH decoding consists of three steps: syndrome calculation,
correcting capability of the BCH code. The coefficients of
calculation of the error location polynomial, and the Chien
the error location polynomial σ(x) = σ0 + σ1 x + . . . + σt xν
search which determines the error positions. For BCH codes
satisfy a set of equations called Newton’s identities. In matrix
of moderate length (over Galois fields GF (26 ), . . . , GF (212 )),
form these equations are
the syndrome calculation and the Chien search can be per-
formed in parallel structures that calculate all syndrome values Aν Δ ν = S ν . (3)

DOI: 10.30420/454862002

ISBN 978-3-8007-4862-4 7 © 2019 VDE VERLAG GMBH  Berlin  Offenbach


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on November 09,2022 at 08:12:38 UTC from IEEE Xplore. Restrictions apply.
SCC 2019 · February 11 – 14, 2019 in Rostock, Germany

snydrome err. location Chien


r(x)
calculation poly. circuit search

delay v(x)

Fig. 1. Structure of a BCH decoder.

With σ0 = 1, the (i × i) matrix These solutions are used in [21], [24], [11] for decoding
⎛ ⎞ BCH codes. The main difference between [21] and [11] is
1 0 0 0 ... 0
⎜ S2 S 1 0 ... 0⎟ the implementation of the Galois field inversion in Equation
⎜ 1 ⎟
⎜ 0⎟ (9). For instance, in [11] a parallel hardware implementation
Ai = ⎜ S 4 S3 S2 S1 ... ⎟, (4)
⎜ .. .. .. .. .. ⎟ is proposed. This architecture requires only 4 Galois field mul-
⎝ . . . . ... .⎠ tipliers, but additionally a Galois field inversion is required.
S2i S2i−1 S2i−2 ... The complexity and the throughput of this architecture are
the vector of coefficients determined by the inversion. For the Galois field GF (210 ) the
⎛ ⎞ size of the inversion is about twice the size of a multiplier and
σ1
⎜σ2 ⎟ the length of the critical path is four times longer than that
⎜ ⎟
Δi = ⎜ . ⎟ , (5) of a multiplier. In [21] the inversion is implemented using a
⎝ .. ⎠ look-up table, which is only efficient for small Galois fields,
σi because the table size is of order O(m2m ). Even for moderate
and the syndrome vector Galois field sizes, e.g. m = 8, . . . , 12, such look-up tables are
⎛ ⎞ costly if multiple instances of the decoder are required.
−S1
⎜ ⎟ In the following, we propose an algorithm for triple errors
⎜ −S3 ⎟
Si = ⎜ .. ⎟. (6) that omits the Galois field inversion similar to the approach
⎝ . ⎠ in [24] that considers double errors. Omitting inversion reduces
−S2i+1 the hardware complexity and speeds up the calculation. First,
Note that the matrix Ai is singular for i > ν. Hence, we consider the case for single and double errors. Not that the
Peterson’s algorithm first calculates the number of errors roots of the error location polynomial do not change, if we
ν. Starting with i = t the determinant Di = det(Ai ) is multiply all coefficients with a non-zero factor. For instance,
calculated. If Di = 0 then the algorithm reduces the size of multiplying the right hand side of Equation (8) with S1 = 0,
the matrix Ai (decreases i) until Di = det(Ai ) = 0 holds we obtain an equivalent solution
and Equation (3) can be solved.
σ(x) = S1 + S12 x + D2 x2 (10)
Finally, the Chien search determines the error positions by
searching for the roots of the error location polynomial. The
for ν = 2 with the determinant
calculation of σ(αi ) for i = 0, . . . , n − 1 can be conducted in
parallel using simple logic operations [13], [17]. D2 = S3 + S13 . (11)
III. C ALCULATING THE ERROR LOCATION POLYNOMIAL
FOR SINGLE , DOUBLE , AND TRIPLE ERRORS Note that for ν = 1 and ν = 2, S1 is non-zero. For a single
For single, double, and triple errors the following direct error in position i we have S1 = αi = 0. Similarly, for two
solutions of the Newton’s identities follow [23] errors in positions i and j, we have S1 = αi +αj = 0, because
αi = αj . Equation (10) is also a solution for ν = 1, because
σ(x) = 1 + S1 x for ν = 1 (7) D1 = S1 = 0 and D2 = 0 holds for a single error.
3
S 3 + S1 2 Next, we consider the case ν ≥ 2. For ν = 2 and ν = 3,
σ(x) = 1 + S1 x + x for ν = 2 (8)
S1 we have D2 = 0 [21]. To see that first consider ν = 2, we
S 2 S3 + S5 2 have S1 = αi + αj and S3 = α3i + α3j . Hence,
σ(x) = 1 + S1 x + 1 x +
S3 + S13
 S13 + S3 = (αi + αj )3 + α3i + α3j
3 S12 S3 + S5
S1 + S3 + S1 x3 for ν = 3 (9) = αi+2j + α2i+j = 0 for i = j. (12)
S3 + S13

ISBN 978-3-8007-4862-4 8 © 2019 VDE VERLAG GMBH  Berlin  Offenbach


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on November 09,2022 at 08:12:38 UTC from IEEE Xplore. Restrictions apply.
SCC 2019 · February 11 – 14, 2019 in Rostock, Germany

Similarly, for ν = 3 we have S1 = αi + αj + αk and S3 = The size of a bit-parallel multiplier grows with order O(m2 )
α3i + α3j + α3k . Consequently, and the critical path with O(m). The Galois field inversion
is often implemented using Fermat’s little theorem, which
S13 + S3 = (αi + αj + αk )3 + α3i + α3j + α3k
requires only a single multiplier and a squaring operation, but
= αi+2j + αi+2k + αj+2k + m−1 clock cycles [25]. Hence, the total number of basic logic
αj+2i + αk+2i + αk+2j . (13) operations per inversion is of order O(m3 ). On the other hand,
the addition and squaring operations are of order O(m) with a
The last term is the determinant of the following matrix critical path length O(1). Consequently, these two operations
⎛ ⎞
1 αi α2i are neglected in the following discussion.
⎝1 αj α2j ⎠ . (14)
1 αk α2k S3 S5
S1
This matrix has full rank, because the columns are linearly
independent. Hence, D2 = 0 holds for ν = 3.
Now, multiplying the right hand side of Equation (9) by D2 , ( )2
we obtain an equivalent solution for ν = 3 as
σ(x) = D2 + S1 D2 x + δ2 x2 + D3 x3 (15)
with
δ2 = S12 S3 + S5 (16)
and the determinant
D3 = S1 (S2 S3 + S1 S4 ) + S32 + S1 S5 . (17)
S1 D2 δ2
Using (1), (11), and (16) we obtain
D3 = S1 (S12 S3 + S5 ) + S16 + S32 ( )2
= S1 δ2 + D22 . (18)
The decoding procedure is summarized in Algorithm 1. This
algorithm can easily be adapted to the decoding of single
and double error correcting codes, e.g. setting D3 = 0 for
double error correcting BCH codes. This is important for
the decoding of GC codes that use nested inner codes [13],
[14], where the error correcting capability increases from level
to level. In [14], Chase decoding of the inner BCH codes σ1 σ0 σ3 σ2
is used for soft-input decoding. This requires multiply BCH
decoding operations for each received BCH codeword. Due Fig. 2. Hardware architecture of the decoder pipeline.
to the complexity of the soft-input decoding and the small
performance gain for the better protected levels, the Chase Algorithm 1 can be implemented performing all operations
decoding is limited to the first three levels in [14]. These are in parallel. Such an implementation requires four multipliers
single, double, and triple error correcting BCH codes. and has a critical path length of two multipliers. It is more
efficient than the implementation proposed in [11]. The archi-
Algorithm 1 Inversion-less Peterson algorithm tecture in [11] uses three multipliers and one inversion, where
calculate D2 , δ2 , D3 the logic for the inversion is about twice the size of a multiplier
if D3 == 0 then (for GF (210 )) and the critical path length of the inversion is
return σ(x) = S1 + S12 x + D2 x2 equivalent to four multiplications. The total critical path in [11]
else has a length that is equivalent to six multiplications. Hence,
return σ(x) = D2 + S1 D2 x + δ2 x2 + D3 x3 at a smaller size the proposed algorithm has a significantly
shorter critical path.
Moreover, the proposed algorithm enables pipelined ar-
IV. H ARDWARE ARCHITECTURE chitectures that can speed up the decoding. Whereas the
In this section, we present a hardware architecture for the Galois field inversion is an atomic operation which limits
proposed decoding algorithm and compare its speed (critical the efficiency of pipelining. Figure 2 present such a pipeline
path length) and the array consumption with other algorithms. (without control logic). The pipeline requires four multipliers
Not that the critical path length and the circuit size is domi- and additional registers (three registers of width m bits) to
nated by the Galois field multipliers and Galois field inversion. store intermediate results. The pipeline reduces the critical

ISBN 978-3-8007-4862-4 9 © 2019 VDE VERLAG GMBH  Berlin  Offenbach


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on November 09,2022 at 08:12:38 UTC from IEEE Xplore. Restrictions apply.
SCC 2019 · February 11 – 14, 2019 in Rostock, Germany

path length to a single multiplication. Hence, the pipelined ACKNOWLEDGMENT


architecture doubles the throughput compared with the struc-
ture without pipeline. Note that the fully parallel BMA also We thank Hyperstone GmbH, Konstanz for supporting this
has a critical path length of a single multiplication. However, project. The German Federal Ministry of Research and Ed-
the parallel BMA requires 2t multipliers, at least 2t registers, ucation (BMBF) and the Shota Rustaveli National Science
and t iterations [26], [27]. Hence, the proposed architecture is Foundation (SRNSF) supported the research for this article
smaller and about three times as fast as the parallel BMA. (BMBF 03FH025IX5 and SRNSF FR17 74).
To verify above size considerations, the proposed decoding
R EFERENCES
algorithm has been implemented on a field-programmable gate
array (FPGA) in Verilog. Table I contains results for the [1] A. Fahrner, H. Griesser, R. Klarer, and V. Zyablov, “Low-complexity
Xilinx Virtex-7 FPGA. The fundamental building blocks of GEL codes for digital magnetic storage systems,” IEEE Transactions
on Magnetics, vol. 40, no. 4, pp. 3093–3095, July 2004.
an FPGA are flip-flops and the look-up tables (LUT). The
[2] J. Spinner, J. Freudenberger, and S. Shavgulidze, “A soft input decoding
size of the logic is represented by the number of LUT. Table I algorithm for generalized concatenated codes,” IEEE Transactions on
represents data for m = 8 and m = 12. As can be seen, Communications, vol. 64, no. 9, pp. 3585–3595, Sept 2016.
both decoders require 3m flip-flops for the registers. The [3] I. Zhilin, A. Kreschuk, and V. Zyablov, “Generalized concatenated codes
with soft decoding of inner and outer codes,” in International Symposium
size of the decoder for m = 12 is dominated by the four on Information Theory and Its Applications (ISITA), Oct 2016, pp. 290–
multipliers which require about 90% of the logic. The speed of 294.
the decoders is determined by the achievable clock frequency [4] V. V. Z. I. V. Zhilin, “Generalized error-locating codes with component
codes over the same alphabet,” Problems Inform. Transmission, vol. 53,
fclk . The circuit for m = 8 achieves a clock frequency fclk = no. 2, pp. 114–135, Sept 2017.
500MHz, i.e. a throughput of 500 · 106 BCH codewords per [5] M. Rajab, “Soft-input bit-flipping decoding of generalized concatenated
second, because one codeword is processed per clock cycle. codes for application in non-volatile flash memories,” IET Communica-
tions, Nov, 2018.
The latency is two clock cycles, i.e. 4ns. Moreover, the decoder [6] S. Cho, D. Kim, J. Choi, and J. Ha, “Block-wise concatenated BCH
for m = 8 is about 1.5 times faster than the decoder for codes for NAND flash memories,” IEEE Transactions on Communica-
m = 12 which confirms that the critical path length is of order tions, vol. 62, no. 4, pp. 1164–1177, April 2014.
[7] D. Kim and J. Ha, “Quasi-primitive block-wise concatenated BCH codes
O(m). Similarly, the ratio of 2.2 for the logic size for m = 12 for NAND flash memories,” in IEEE Information Theory Workshop
and m = 8 agrees well with the estimate O(m2 ) for the circuit (ITW), Nov 2014, pp. 611–615.
size. The results for the parallel BMA are estimates based on [8] ——, “Quasi-primitive block-wise concatenated BCH codes with col-
laborative decoding for NAND flash memories,” IEEE Transactions on
the required number of multipliers and registers. The actual Communications, vol. 63, no. 10, pp. 3482–3496, Oct 2015.
size will be higher. The BMA requires three clock cycles per [9] ——, “Serial quasi-primitive BC-BCH codes for NAND flash memo-
BCH codeword. Hence, the achievable throughput is 111 · 106 ries,” in 2016 IEEE International Conference on Communications (ICC),
BCH codewords per second at a clock frequency of fclk = May 2016, pp. 1–6.
[10] K. Lee, H. Kang, J. Park, and H. Lee, “100Gb/s two-iteration concate-
333MHz. nated BCH decoder architecture for optical communications,” in 2010
IEEE Workshop On Signal Processing Systems, Oct 2010, pp. 404–409.
[11] X. Zhang and Z. Wang, “A low-complexity three-error-correcting BCH
decoder for optical transport network,” IEEE Transactions on Circuits
V. C ONCLUSIONS and Systems II: Express Briefs, vol. 59, no. 10, pp. 663–667, Oct 2012.
[12] C. Yang, Y. Emre, and C. Chakrabarti, “Product code schemes for error
correction in MLC NAND flash memories,” IEEE Transactions on Very
In this paper, we have proposed an algorithm to compute Large Scale Integration (VLSI) Systems, vol. 20, no. 12, pp. 2302–2314,
Dec 2012.
the error location polynomial for single, double, and triple [13] J. Spinner and J. Freudenberger, “Decoder architecture for generalized
error correcting binary BCH codes. The proposed method is concatenated codes,” IET Circuits, Devices & Systems, vol. 9, no. 5, pp.
an inversion-less version of Peterson’s algorithm. For triple 328–335, 2015.
[14] J. Spinner, D. Rohweder, and J. Freudenberger, “Soft input decoder
errors the proposed algorithm is more efficient than the BMA. for high-rate generalised concatenated codes,” IET Circuits, Devices
The presented pipelined decoding architecture is faster than Systems, vol. 12, no. 4, pp. 432–438, 2018.
decoders that employ inversion or the fully parallel BMA at [15] B. P. Smith, A. Farhood, A. Hunt, F. R. Kschischang, and J. Lodge,
“Staircase codes: FEC for 100 Gb/s OTN,” Journal of Lightwave
a comparable circuit size. The new decoder can be applied Technology, vol. 30, no. 1, pp. 110–117, Jan 2012.
for decoding the BCH component codes in concatenated [16] G. Hu, J. Sha, and Z. Wang, “Beyond 100Gbps encoder design for
codes [6], [8]. In particular, with GC codes that employ staircase codes,” in 2016 IEEE International Workshop on Signal
Processing Systems (SiPS), Oct 2016, pp. 154–158.
nested inner BCH codes [13], [3], [4] it is important that
[17] D. Strukov, “The area and latency tradeoffs of binary bit-parallel BCH
the decoder supports different error correcting capabilities, decoders for prospective nanoelectronic memories,” in 2006 Fortieth
because the error correcting capability increases from level to Asilomar Conference on Signals, Systems and Computers, Oct 2006,
level. Furthermore, the proposed decoder may help to speed pp. 1183–1187.
[18] P. Amato, C. Laurent, M. Sforzin, S. Bellini, M. Ferrari, and A. Toma-
up soft-input decoding algorithms for GC codes that are based soni, “Ultra fast, two-bit ECC for emerging memories,” in 2014 IEEE
on Chase decoding [14], [28]. The Chase decoding procedure 6th International Memory Workshop (IMW), May 2014, pp. 1–4.
requires multiply BCH decoding operations for each received [19] C. Yang, M. Mao, Y. Cao, and C. Chakrabarti, “Cost-effective design
solutions for enhancing PRAM reliability and performance,” IEEE
BCH codeword, where the calculation of the error location Transactions on Multi-Scale Computing Systems, vol. 3, no. 1, pp. 1–11,
polynomial limits the achievable throughput. Jan 2017.

ISBN 978-3-8007-4862-4 10 © 2019 VDE VERLAG GMBH  Berlin  Offenbach


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on November 09,2022 at 08:12:38 UTC from IEEE Xplore. Restrictions apply.
SCC 2019 · February 11 – 14, 2019 in Rostock, Germany

TABLE I
R ESULTS FOR THE FPGA IMPLEMENTATION FOR THE PROPOSED ALGORITHM .

Module number of number of throughput


LUT flip-flops codewords per second
proposed decoder GF (28 ) 145 24 500 · 106
multiplier GF (212 ) 71 – 333 · 106
proposed decoder GF (212 ) 318 36 333 · 106
parallel BMA GF (212 ) 426 84 111 · 106

[20] W. Peterson, “Encoding and error-correction procedures for the Bose- inverses in GF(2m ) using normal bases,” Information and computation,
Chaudhuri codes,” IRE Transactions on Information Theory, vol. 6, vol. 78, no. 3, pp. 171–177, 1988.
no. 4, pp. 459–470, September 1960. [26] W. Liu, J. Rho, and W. Sung, “Low-power high-throughput BCH error
[21] E.-H. Lu, S.-W. Wu, and Y.-C. Cheng, “A decoding algorithm for triple- correction VLSI design for multi-level cell NAND flash memories,” in
error-correcting binary BCH codes,” Information Processing Letters, IEEE Workshop on Signal Processing Systems Design and Implementa-
vol. 80, no. 6, pp. 299 – 303, 2001. tion (SIPS), oct. 2006, pp. 303 –308.
[22] S. Lin and D. J. Costello, Error Control Coding. Upper Saddle River, [27] J. Freudenberger and J. Spinner, “A configurable Bose-Chaudhuri-
NJ: Prentice-Hall, 2004. Hocquenghem codec architecture for flash controller applications,” Jour-
[23] Y. Jiang, A practical guide to error-control coding using Matlab. Artech nal of Circuits, Systems, and Computers, vol. 23, no. 2, pp. 1–15, Feb
House, 2010. 2014.
[24] N. Ahmadi, M. H. Sirojuddiin, A. D. Nandaviri, and T. Adiono, “An [28] J. Freudenberger, M. Rajab, and S. Shavgulidze, “A soft-input bit-
optimal architecture of BCH decoder,” in 2010 4th International Confer- flipping decoder for generalized concatenated codes,” in 2018 IEEE
ence on Application of Information and Communication Technologies, International Symposium on Information Theory (ISIT), June 2018, pp.
Oct 2010, pp. 1–5. 1301–1305.
[25] T. Itoh and S. Tsujii, “A fast algorithm for computing multiplicative

ISBN 978-3-8007-4862-4 11 © 2019 VDE VERLAG GMBH  Berlin  Offenbach


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on November 09,2022 at 08:12:38 UTC from IEEE Xplore. Restrictions apply.

You might also like