Low-Complexity Transformed Encoder Architectures For Quasi-Cyclic Nonbinary LDPC Codes Over Subfields
Low-Complexity Transformed Encoder Architectures For Quasi-Cyclic Nonbinary LDPC Codes Over Subfields
Abstract— Quasi-cyclic low-density parity-check (QC-LDPC) This paper focuses on the cases that the dimension of the
codes are adopted in many digital communication and storage CPMs is (2r − 1) × (2r − 1) and p divides r . Accordingly,
systems. The encoding of these codes is traditionally done GF(2 p ) is a subfield of GF(2r ). These cases are quite versatile
by multiplying the message vector with a generator matrix
consisting of dense circulant submatrices. To reduce the encoder and binary codes are a special case with p = 1.
complexity, this paper introduces two schemes making use of LDPC code design usually starts with the H matrix. If H
finite Fourier transform. We focus on QC-LDPC codes whose has a lower triangular structure, then the encoding can be
circulant submatrices are of dimension (2 r − 1) × (2 r − 1) and done by using H directly. However, such an H matrix often
the entries are elements of GF(2 p ), where p divides r, and hence, has at least a few columns with low weight. This makes the
GF(2 p ) is a subfield of GF(2 r ). These cover a broad range
of codes, and binary LDPC codes are a special case. Making corresponding bits more likely stay at wrong values, especially
use of conjugacy constraints, low-complexity architectures are in bit-flipping type of decoders, and hence leads to earlier error
developed for finite Fourier and inverse transforms over subfields floor. A more general encoding method that applies to code
in this paper. In addition, composite field arithmetic is exploited with any H is to derive a systematic generator matrix G from
to eliminate the computations associated with message mapping H first, and then multiply the message vector with G [5].
and reduce the complexity of Fourier transform. For a (2016,
1074) nonbinary QC-LDPC code whose generator matrix consists The computed G consists of blocks of circulant matrices,
of circulants of dimension 63 × 63 with GF(22 ) entries, the pro- and the circulants corresponding to the parity symbols are
posed encoders achieve 22% area reduction compared with the usually dense. This means that a large number of multipliers
conventional encoders without sacrificing the throughput. are required for encoding.
Index Terms— Encoder, finite field, Fourier transform, A transformed approach was proposed in [6] to reduce
low-density parity-check (LDPC) codes, nonbinary (NB), the encoder complexity of binary QC-LDPC codes. Applying
quasi-cyclic (QC), VLSI architecture. finite Fourier transform, each dense circulant becomes a diago-
nal matrix, and hence, the number of multipliers needed for the
I. I NTRODUCTION transformed generator matrix multiplication is greatly reduced
at the cost of Fourier transform and some message mapping.
B INARY low-density parity-check (LDPC) codes are
broadly adopted in digital communication and storage
systems. Compared with binary codes, nonbinary LDPC
Later, two improved encoding methods making use of finite
Fourier transform were described in [7]. They are the binary
(NB-LDPC) codes, whose entries in the parity check and cases of the encoders over subfields developed independently
generator matrices are GF(2 p ) ( p > 1) elements, can achieve in [8] and [9]. No hardware implementation architecture was
higher coding gain when the code length is moderate and have provided in [7]. If implemented directly according to the
better burst error-correction performance. NB-LDPC codes formulas, the overheads brought by the Fourier transform,
are also finding their ways into practical systems due to message mapping, and their inverse may offset the savings
recent work on decoder complexity reduction, such as those achieved by the simplified generator matrix multiplication.
in [1]–[3]. This paper first presents the two modified transformed
Quasi-cyclic LDPC (QC-LDPC) codes enable efficient encoders for QCNB-LDPC codes over subfields described
partial-parallel processing because of the regularity in their in [8] and [9] with more details. Unlike the approach in [6],
parity-check matrices H . The H matrix of a binary QC-LDPC the modified encoders do not require row or column per-
code consists of cyclic permutation matrices (CPMs), and mutation on G. The first encoder maps the messages in an
that of a QCNB-LDPC code can be designed by replacing intrablock instead of interblock way, and the second encoder
each “1” in the CPMs with elements of GF(2 p ) ( p > 1) [4]. does not need message mapping at all. As a result, the data
The dimension of the CPMs is usually moderate or large in flow is simplified and large buffers are eliminated. By making
order to reach a reasonably long codeword length. On the other use of the property that the inputs (outputs) are subfield
hand, p needs to be small to keep the decoder complexity low. elements and outputs (inputs) satisfy the conjugacy constraints,
efficient partial-parallel architectures are developed for the
Manuscript received July 2, 2016; revised September 29, 2016; accepted Fourier (inverse) transform over subfields. The simplified
November 12, 2016. architectures are made possible by exploiting composite field
The authors are with Western Digital Corporation, San Jose, CA 95138
USA. arithmetic, which also contributes to the elimination of the
Digital Object Identifier 10.1109/TVLSI.2016.2630055 computational complexity associated with the message map-
1063-8210 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
ZHANG AND TAI: LOW-COMPLEXITY TRANSFORMED ENCODER ARCHITECTURES FOR QCNB-LDPC CODES OVER SUBFIELDS 3
Here, (2i )e denotes 2i mod e. To make the encoder output III. M ODIFIED F INITE F OURIER -T RANSFORMED
binary for binary messages, it was proposed in [6] to per- E NCODERS OVER S UBFIELDS
mute G F and carry out mappings on m. The rows of G F If the message symbols are elements of GF(2 p ), the code-
are permuted according to πrow = [0, e, 2e, . . . , (k − 1)e, 1, word should be also over GF(2 p ) to reduce the storage require-
e + 1, 2e + 1, . . . , (k − 1)e + 1, . . . , e − 1, 2e − 1, 3e − 1, . . . , ment and decoding complexity. The conjugacy constraints can
ke − 1]. It means that row e of G F becomes the second row be extended to make the transformed encoding outputs in
after permutation, row 2e of G F becomes the third row after the same finite field. An e-tuple vector w is over GF(2 p ),
permutation, and so on. The columns are also permuted in a a subfield of GF(2r ), if and only if the symbols in its
similar way. The permuted and transformed generator matrix, Fourier transform d = F (w) satisfy the following conjugacy
G F ,π , only has e nonzero matrices of dimension k × n in the constraints [11]:
diagonal. In addition, an e-tuple vector formed by the entries p
in the same row and column inside each of these e submatrices d(2 p i)e = di2 . (4)
satisfies the conjugacy constraints. This is referred to as that
The corresponding cyclotomic cosets are e, {i, (2 p i )
e, (22 p i )
the conjugacy constraints are satisfied in an interblock way.
. . . , (2(ηi −1) p i )e }, and the cardinality, ηi , of each coset is a
m is also divided into e blocks of k messages and mapped to
divisor of r/ p.
m̂ to satisfy the conjugacy constraints in an interblock way.
In the transformed encoding of [6], G F is permuted to
The indices of the elements satisfying the conjugacy con-
satisfy the conjugacy constraints in an interblock manner.
straints form cyclotomic cosets {i, (2i )e , (22 i )e , . . . (2ηi −1 i )e }.
Then, after the mapped messages are multiplied, the product
The cardinality of each coset, ηi , must be a divisor of r .
is reversely permuted to satisfy the conjugacy constraints in an
Let {βi,0 , βi,1 , . . . , βi,ηi −1 } be a basis of the subfield GF(2ηi ).
intrablock way. However, the entries in each of the diagonal
The message mapping is done for each cyclotomic coset and
submatrices of G F already satisfy the constraints (4) in an
μ = 0, 1, . . . , ηi − 1, j = 0, 1, . . . , k − 1 as
intrablock way when the entries of G are over GF(2 p ). If the
⎛ ⎞2μ massage vector multiplied to G F satisfies the constraints in
ηi −1
the same way, then no permutation is required on G F and the
m̂ (2μ i)e k+ j = ⎝ βi,l m (2l i)e k+ j ⎠ . (2) product does not need to be reversely permuted. Inspired by
l=0
this, two modified transformed encoders are proposed next.
As a result, m̂GF ,π satisfies the conjugacy constraints in
an interblock way. If reversely permuted, it would become A. Modified Transformed Encoder A
a vector that satisfies the constraints in an intrablock man- Divide m into k blocks of e messages. Each block is mapped
ner, i.e., each block of e consecutive symbols satisfies the to a block of e messages in m̄ that satisfy the conjugacy
constraints. Accordingly, binary codewords are derived as constraints in (4) according to
follows [6]: ⎛ ⎞2pμ
ηi −1
c = ĉ F ,π −1
V −1(n)
= (m̂G F ,π π −1
) V −1(n)
. (3) m̄ ej +(2pμ i)e = ⎝ βi,l m ej +(2pl i)e ⎠ , (5)
l=0
The c value computed from (3) is different from mG and that where μ = 0, 1, . . . , ηi − 1, j = 0, 1, . . . , k − 1,
in (1). However, all of them are valid codewords. and {βi,0 , βi,1 , . . . , βi,ηi −1 } is a basis of GF((2 p )ηi ). Using
Despite that computing m̂GF ,π requires much fewer multi- m̄, the encoding is modified as
pliers than calculating mG, the transformed encoding accord-
ing to (3) has drawbacks. The indices in a cyclotomic coset can c = c̄F V −1(n) = (m̄GF )V −1(n) . (6)
be quite far apart. This requires large buffers to carry out the For a given m, the c value computed from (6) is not the same
interblock mapping in (2). The permutations on G F and ĉF as those from (3) or the original encoding. However
complicate their storage and access. Besides the message
mapping in (2) and inverse Fourier transform in (3), inverse cHT = (m̄GF )V −1(n) H T
mapping and Fourier transform are needed in the decoder. = m̄V−1(k) GV(n) V −1(n) H T
The overhead of these computations may offset the saving
= m̄V−1(k) GHT = 0
brought by the less-complicated multiplication of m̂ and G F ,π .
The work in [6] is limited to binary codes and provided only and hence, it is also a valid codeword. Since each block of
theoretical complexity analysis. In Section III, the transformed e symbols in m̄G F satisfies the conjugacy constraints, the
encoding is first extended to the cases that the message sym- encoding using (6) also generates a codeword over GF(2 p ).
bols are elements of a subfield. Then, two modified encoding In (5), the mapping is done based on the messages in
schemes, A and B, are described. The encoder in the trans- the same block. Assume that the input messages arrive in a
formed domain and that by Galois Fourier transform in [7] partial-parallel way. The extra message buffer needed in the
are the binary cases of encoders A and B, respectively. Joint proposed encoding is only for one single block of e messages.
algorithmic and architectural optimizations are proposed in On the other hand, the interblock message mapping in (2) may
the section after to develop efficient hardware implementation involve messages from blocks quite far apart. For example,
architectures for our modified transformed encoders. when r = 6 and p = 2, one cyclotomic coset is {3, 12, 48},
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
and hence, message blocks 3–48 need to be buffered for out alternatively as
the corresponding message mapping. Moreover, the proposed
c = m F G F V −1(n) . (7)
modified encoding does not need any row or column permu-
tation. This simplifies the access of the memory storing the It can be easily proved that the c value from (7) satisfies
generator matrix, and no buffer or routing network is needed cHT = 0, and hence is also a valid codeword.
on m̄GF . As a result, the intrablock message mapping and
modified encoding in (6) lead to not only substantial buffer IV. VLSI A RCHITECTURES FOR M ODIFIED E NCODERS
reduction but also much simpler dataflow.
To simplify the notations, this section uses a QCNB-LDPC
Although [6] and [7] mentioned that a basis is needed for
code over GF(22 ) ( p = 2) whose generator matrix consists of
message mapping, they did not specify what type of bases
circulants of dimension 63×63 (r = 6) to explain the proposed
should be used to reduce the implementation complexity.
encoder architectures. Our architectures can be easily extended
The mapping is done as basis element multiplications followed
to codes with different p and r values, such that p|r . Binary
by additions. However, if the basis is not chosen properly,
codes belong to these cases with p = 1. Each component is
it may be very difficult to recover m from the mapped
detailed before the overall encoder architectures are summa-
messages in the decoder. For cyclotomic cosets of the same
rized. The same composite field arithmetic, Fourier and inverse
cardinality, the same basis is used. Moreover, m consists of
transform architectures, and generator matrix multiplication
elements of GF(2 p ), a subfield of GF(2r ) from which the sym-
architecture are used in both encoders.
bols of m̄ are drawn. Intuitively, the composite field GF((2 p )t )
(t = r/ p), which is isomorphic to GF(2r ), can be adopted
to reduce the mapping complexity. Let f (x) be a degree-t A. Composite Field Construction
polynomial that is irreducible over GF(2 p ). An element of To construct GF((2 p )t ) from GF(2 p ), a degree-t irreducible
GF((2 p )t ) can be expressed as a0 + a1 x + · · · + at −1 x t −1, polynomial over GF(2 p ), f (x), is needed. As aforementioned,
where as ∈ GF(2 p ) (0 ≤ s < t) and x is a root of f (x). ηi an element a ∈ GF((2 p )t ) can be represented as a polynomial
is a divisor of t. When ηi = t, a good basis to use for the in x with degree t − 1, at −1 x t −1 + · · · + a1 x + a0 , whose
message mapping is {βi,0 , βi,1 , . . . , βi,t −1 } = {1, x, . . . , x t −1 }. coefficients are elements of GF(2 p ). Finite field multiplica-
In this case, proper symbols of m are collected to be the as tion is defined as polynomial multiplication modulo f (x),
coefficients of each m̄ ej+i in composite field representation. and field addition is done as polynomial addition. GF(22 ) is
This does not require any logic computation. Then, m̄ ej+(2pμ i)e constructed using f 0 (x) = x 2 + x + 1. It is the only degree-2
for μ > 0 are computed by iteratively taking the 2 p th power irreducible polynomial over GF(2). However, for constructing
of m̄ ej+i , which is implemented by simple logic for finite fields GF((22 )3 ) from GF(22 ), there are quite a few options for the
of characteristic two. Similarly, the inverse mapping is done by irreducible polynomial. To reduce the complexity of finite field
assigning the as coefficients in each m̄ ej+i to proper symbols multiplications, an irreducible polynomial with fewer nonzero
in m. Hence, the inverse mapping does not require any logic terms is preferred. For example, f 1 (x) = x 3 + φ, where
gate either. If ηi = 1, no mapping is needed. When 1 < ηi < t, φ = 10 ∈ GF(22 ), is a degree-3 irreducible polynomial over
each of the β value in the basis should be an element of the GF(22 ) with the fewest nonzero terms.
subfield GF((2 p )ηi ). Among all the elements of GF((2 p )ηi ), The Fourier transform over finite fields is defined using
pick ηi of them that have nonzero as at distinct locations. They power representation α i j , where α is a primitive element
form a basis of the subfield, and the mapping using this basis of GF(2r ). If the irreducible polynomial f (x) used for finite
can be done in a similar way as in the case of ηi = t. field construction is primitive, then α can be considered as
Besides making the message mapping a very easy process, a root of f (x) and the power representation is converted to
adopting composite fields has other benefits. Composite field polynomial representation easily by using the property that
multipliers have lower complexity than the multipliers in the f (α) = 0. Unfortunately, the irreducible polynomial with the
original field. They also enable simplifications of the Fourier fewest nonzero terms for composite field construction is not
transform over subfields as will be shown in Section IV. necessarily a primitive polynomial. For example, f 1 (x) =
A variation of the encoder described in this section for x 3 + φ is irreducible but not primitive. On the other hand,
binary codes was also presented in [7]. m is directly mul- all degree-3 primitive polynomials over GF(22 ) have four
tiplied with G F . Then the products are mapped using basis nonzero terms. If a nonprimitive irreducible polynomial, such
{βi,0 , βi,1 , . . . , βi,ηi −1 } according to an equation similar to (5) as f 1 (x), is employed for field construction, the primitive
in order to satisfy the conjugacy constraints. Since the symbols element α used in V and V −1 cannot be a root of f 1 (x),
of mGF are elements of GF(2r ), the mapping in this case and a mapping needs to be developed to convert those power
requires actual multiplications with the basis elements. representations to polynomial representations. Two different
representations of finite field elements should be isomorphic.
The identity element in polynomial representation should
B. Modified Transformed Encoder B
be mapped to the identity element in power representation.
Another vector that satisfies the conjugacy constraints in an x, a root of f 1 (x) in composite field representation, should
intrablock manner is m F , which is derived by applying Fourier be mapped to a power representation, α u , that has the same
transform to each block of e symbols in m. Since m F G F also order and is also a root of f 1 (x). Once the mapping of α u is
satisfies the conjugacy constraints, the encoding can be carried decided, the other elements in the same conjugacy class are
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
ZHANG AND TAI: LOW-COMPLEXITY TRANSFORMED ENCODER ARCHITECTURES FOR QCNB-LDPC CODES OVER SUBFIELDS 5
ZHANG AND TAI: LOW-COMPLEXITY TRANSFORMED ENCODER ARCHITECTURES FOR QCNB-LDPC CODES OVER SUBFIELDS 7
TABLE II
H ARDWARE C OMPLEXITIES OF 8- PARALLE M ODIFIED T RANSFORMED
E NCODERS F ROM A RCHITECTURAL A NALYSIS FOR (2016,1764)
QCNB-LDPC C ODE OVER GF(22 ) W ITH r = 6, p = 2, AND n −k = 4
TABLE I
blocks of symbols, as summarized in Table I. Accordingly,
N UMBERS OF F OURIER T RANSFORM AND I NVERSE N EEDED BY
THE M ODIFIED T RANSFORMED E NCODING /D ECODING S YSTEMS
its overall power consumption would be lower than that of
encoding approach A, especially for high-rate codes. However,
if lower complexity and power consumption are preferred in
the encoder itself, such as in the case that the encoder is
located in a power-constraint device and the decoder is not,
approach A is a better choice, since the Fourier transform is
located in the decoder, as shown in Fig. 7.
to be inverse Fourier transformed to derive the codeword.
When the inverse Fourier transform architecture in Fig. 4 is V. C OMPLEXITY A NALYSIS AND C OMPARISONS
employed, those nonleading symbols are needed, and they
are recovered in the message expanding block. This block This section analyzes the complexities of the proposed
computes the 2 p th powers of the leading symbols, and can encoders for two example codes. One is a (2016, 1764)
be implemented easily. When t is not small and/or some of QCNB-LDPC code over GF(22 ) whose G matrix consists
the traces of the basis elements are zero, the inverse Fourier of 28 × 32 blocks of 63 × 63 circulants. The other is a
transform architecture in Fig. 5 has lower complexity than that (8128, 6985) binary QC-LDPC code whose G matrix consists
in Fig. 4. The message expanding block is not required when of 55 × 64 circulants of dimension 127 × 127. For both
the architecture in Fig. 5 is used. codes, the proposed encoder architectures lead to substantial
When encoder A is adopted, extra steps are needed after complexity reduction.
the LDPC decoding, as shown in Fig. 7(b). According For the (2016, 1764) NB code, the Fourier transform
to (6), Fourier transform is applied to the decoding result to is defined over GF(26 ), and composite field GF((22 )3 ) is
derive m̄GF . Systematic symbols of m̄ are directly observed adopted. f0 (x) = x 2 + x + 1 is used to construct GF(22 )
from m̄GF . Then, m is recovered by inverse mapping. from GF(2), and GF((22 )3 ) is constructed from GF(22 ) by
using f 1 (x) = x 3 + φ, where φ = 10 ∈ GF(22 ). If x is
a root of f1 (x), then Tr(x 2 ) = Tr(x) = 0 and Tr(1) = 1.
F. Modified Transformed Encoder Architecture B As a result, the inverse Fourier transform architecture in Fig. 5
The first ke columns of G F are also an identity matrix. has lower complexity. The trace blocks in this architecture
From (7), the first k blocks of e symbols in the codeword are eliminated, and each of the constant multipliers, except
generated from encoder B are those in the first column, computes only the least significant
coefficient in the product. Additions over finite fields using
m F IV−1(k) = mV(k) IV−1(k) = m.
basis representations are bitwise XOR operations. Assume that
As a result, the Fourier transform and inverse for computing a two-input NAND gate takes 1/2 the area of an XOR. A GF(22 )
the systematic part of the codeword are canceled out in multiplier takes five XOR gates to implement and has three
encoder B, and the message symbols directly become the logic gates in the critical path. A GF((22 )3 ) multiplier requires
systematic part. Accordingly, encoder B is implemented by the around 54 XOR gates, and has six logic gates in the critical
architecture shown in Fig. 8. The inverse Fourier transform is path. Constant multipliers are much simpler than general
only applied to the parity symbols. Moreover, no extra step is multipliers, since one of the operands is fixed. For example,
needed after the LDPC decoding when encoder B is employed. a constant multiplier by “10” or “11” over GF(22 ) requires
Encoder B is implemented by the same transformed gen- one XOR gate, and multiplying an element in GF(22 ) by
erator matrix multiplication, Fourier transform, and inverse α 3 ∈ GF((22 )3 ) as needed in the inverse Fourier transform
architectures as in encoder A. It does not need the message architecture is implementable by three XOR gates.
mapping, which is implemented as hard-wiring. Hence, the Assume that l = 8 input message symbols are available at a
area requirements of the two proposed encoding schemes time. To match the throughput of the input symbols, 8-parallel
are very similar. Nevertheless, in encoder B, the message processing is adopted in the inverse Fourier transform, and
symbols do not go through any transform before they become the parallel processing factor for the generator matrix mul-
the systematic part of the codeword. Therefore, encoder B tiplication is set to 3. Since G F has 32 − 28 = 4 parity
needs to activate the inverse Fourier transform for much fewer block columns, 3 × 4 = 12 general multipliers over GF((22 )3 )
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
ZHANG AND TAI: LOW-COMPLEXITY TRANSFORMED ENCODER ARCHITECTURES FOR QCNB-LDPC CODES OVER SUBFIELDS 9
TABLE IV
S YNTHESIS R ESULTS OF E IGHT-PARALLEL E NCODERS FOR (2016,1764)
QCNB-LDPC C ODE OVER GF(22 ) U SING TSMC 28-nm P ROCESS by dividing the total areas from the synthesis reports by that of
an XOR gate. The ROMs storing the matrix entries are actually
implemented by combinational logic. The synthesis tool is able
to optimize and combine a lot of cells. As a result, the gate
counts in Table IV are much smaller than those in Table III.
Nevertheless, architectural analysis is a good tool to compare
the relative complexities of different designs, and the synthesis
are needed in the matrix multiplication block. e/(2 p − 1) = results support similar percentage of area saving achievable by
21-parallel processing is employed in the Fourier transform our proposed designs as analyzed from architectures.
architecture. Although it runs at a speed that is higher than the The encoding algorithms making use of finite Fourier trans-
other components in the encoder, 8-parallel Fourier transform form in [6] and [7] are given without any hardware architec-
design is more costly as explained in Section IV. Table II ture. Implementing the Fourier transform and inverse using
shows the logic gate and memory requirements of the building standard architectures leads to large area and the message
blocks in the proposed encoders from architectural analysis. mapping may have significant overhead if the bases are not
Encoder A consists of the first three components in Table II. chosen properly. As it was mentioned, the two encoding algo-
However, for the purpose of fair comparison, the complexity rithms in [7] are the binary cases of the proposed algorithms.
of the Fourier transform needed in the decoder when encoder In the design of [6], buffering the input messages for interblock
A is used should also be included. Although encoder B does mapping alone needs 5796 registers for the example code,
not require message mapping, it needs registers to buffer a and the matrix permutations make the storage and access of
block of e message symbols before they are sent to the Fourier intermediate results very difficult. Hence, the comparison with
transform. Hence, the total area requirements of the two this encoder is not further pursued.
encoding schemes are very similar. Pipelining is applied before When l increases, the complexity of the over-qualified
and after the generator matrix multiplication in both proposed Fourier transform engine becomes relatively smaller in the
encoders. It only brings 2 × 63/8 = 16 clock cycles of extra overall modified transformed encoders. Hence, the proposed
initial latency. The encoding of a message vector still takes designs would achieve more significant saving over the orig-
32 × 63/8 = 256 clock cycles. inal encoder for larger l. The number of conjugacy classes
The proposed encoders are compared with an l = 8-parallel in GF((2 p )t ) with respect to GF(2 p ) is around 2 pt /t. As t
traditional encoder that computes c = mG using the archi- increases, the G F multiplication needs even fewer multi-
tecture from [5] in Table III. Since the original generator pliers compared with the multiplication by G. The relative
matrix G is dense in the parity columns and the nonzero entries complexities of GF((2 p )t ) multipliers needed in the G F
appear at irregular locations, (n−k) ≤ multipliers over GF(22 ) multiplication and GF(2 p ) multipliers required in multiplying
are employed to take care of the multiplications of all parity G depend on many factors, such as the irreducible polynomials
columns in l rows simultaneously. Similarly, the l products used for field constructions, and whether t is further decom-
computed for each column are accumulated to intermediate posable. Nevertheless, even if p = 1 and a GF(2) multiplier is
results stored in registers. Assuming that each register takes just a single AND gate in the original encoder, the proposed
three times the area of an XOR gate and storing a bit in memory modified transformed encoders can still achieve substantial
needs about the same area as an XOR, the proposed encoders gate count reduction when r = pt is not small.
reduce the area requirement by 19%. Since the critical path and Now, consider a (8128, 6985) binary QC-LDPC code whose
the number of clock cycles needed for encoding each message generator matrix has nine block parity columns of 127 × 127
vector are the same in the original and proposed encoders, their circulants. Assume that l = 16 bits are input to the encoder in
achievable throughput is the same. each clock cycle. The 16-parallel processing is also adopted
To further validate the advantages of our design, both in the inverse Fourier transform. Since there are 19 conjugacy
the proposed and traditional encoders are synthesized using classes in GF(27 ) with respect to GF(2), only l = 3 multipli-
TSMC 28-nm process under timing constraint of 2.5 ns. ers over GF(27 ) are needed for each block of parity columns
The results are shown in Table IV. Both encoders achieve in the transformed generator matrix multiplication. A GF(27 )
6.3-Gb/s throughput. The equivalent gate counts are derived multiplier needs around 85 XOR gates to implement. Since
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VI R EFERENCES
C OMPARISON OF 16-PARALLEL E NCODERS FOR (8128, 6985) B INARY
[1] X. Chen and C.-L. Wang, “High-throughput efficient non-binary LDPC
LDPC C ODE F ROM A RCHITECTURAL A NALYSIS
decoder based on the simplified min-sum algorithm,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 59, no. 11, pp. 2784–2794, Nov. 2012.
[2] F. Cai and X. Zhang, “Relaxed min-max decoder architectures for
nonbinary low-density parity-check codes,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 21, no. 11, pp. 2010–2023, Nov. 2013.
[3] X. Zhang, “Low-complexity min-max non-binary LDPC decoders,”
J. Commun., vol. 10, no. 11, pp. 836–842, Nov. 2015.
[4] B. Zhou, J. Kang, S. Song, S. Lin, K. Abdel-Ghaffar, and M. Xu,
“Construction of non-binary quasi-cyclic LDPC codes by arrays
and array dispersions,” IEEE Trans. Commun., vol. 57, no. 6,
pp. 1652–1662, Jun. 2009.
e/(2 p − 1) = e for binary codes, the partial-parallel Fourier [5] Z. Li, L. Chen, L. Zeng, S. Lin, and W. H. Fong, “Efficient encoding
of quasi-cyclic low-density parity-check codes,” IEEE Trans. Commun.,
transform architecture in Fig. 6 becomes a fully parallel vol. 54, no. 1, pp. 71–81, Jan. 2006.
design. The complexities of the proposed encoders for this [6] Q. Huang, L. Tang, Z. Wang, Z. Xiong, and S. He. (Jan. 2013). “A low-
binary LDPC code are summarized in Table V, and compared complexity encoding of quasi-cyclic codes based on Galois Fourier
transform.” [Online]. Available: https://arxiv.org/abs/1301.3220
with that of the conventional encoder in Table VI. Although [7] Q. Huang, L. Tang, S. He, Z. Xiong, and Z. Wang, “Low-complexity
GF(2) multipliers are two-input AND gates, the conventional encoding of quasi-cyclic codes based on Galois Fourier transform,” IEEE
encoder requires a large number of them. Hence, the multi- Trans. Commun., vol. 62, no. 6, pp. 1757–1767, Jun. 2014.
[8] X. Zhang and Y. Tai, “Encoder with transform architecture
plication with the original generator matrix has much higher for LDPC codes over subfields using message mapping,”
complexity. Despite the overheads of the buffers, Fourier U.S. Patent 2015 0 381 204, Dec. 31, 2015.
transform, and inverse, the proposed encoders require 84% [9] X. Zhang and Y. Tai, “Encoder for quasi-cyclic low-density
parity-check codes over subfields using Fourier transform,”
of the area to implement. The same number of clock cycles is U.S. Patent 2015 0 381 205, Aug. 30, 2016.
needed to generate a codeword. A GF(27 ) multiplier has eight [10] X. Zhang and Y. Tai, “Low complexity partial parallel architectures for
XOR gates in the data path. Although the proposed encoders Fourier transform and inverse Fourier transform over subfields of a finite
field,” U.S. Patent 2015 0 301 985, Apr. 22, 2014.
have longer critical paths than the conventional binary encoder, [11] J. L. Massey and S. Serconek, “A Fourier transform approach to
practical systems prefer using a single clock signal, and usu- the linear complexity of nonlinearly filtered sequences,” in Advances
ally, there are other components in the system with a longer in Cryptology (Lecture Notes in Computer Science), vol. 839. Santa
Barbara, CA, USA: Springer-Verlag, 1994, pp. 332–340.
critical path. In this case, replacing the conventional encoder [12] K. Lee, H.-G. Kang, J.-I. Park, and H. Lee, “A high-speed low-
by the proposed encoders for binary codes may not bring any complexity concatenated BCH decoder architecture for 100 Gb/s optical
throughput penalty for the overall system. communications,” J. Signal Process. Syst., vol. 66, no. 1, pp. 43–55,
Jan. 2012.
The binary encoder example described earlier demonstrated
that our proposed designs also achieve substantial complexity
reduction for codes of longer lengths and/or lower rates,
which translate to larger (n − k)e. The achievable complexity
reduction would be more significant for codes with larger n−k. Xinmiao Zhang (S’04–M’05–SM’10) received the Ph.D. degree from the
This is because the complexity of the entire logic part in the University of Minnesota, Minneapolis, MN, USA, in 2005.
traditional encoder is proportional to n −k. On the other hand, She joined SanDisk (Western Digital) Corporation, Milpitas, CA, USA,
in 2013. She has been a tenured Associate Professor with Case Western
the complexities of the Fourier transform and inverse, which Reserve University, Cleveland, OH, USA. She has authored the book VLSI
account for the majority of the logic part in the transformed Architectures for Modern Error-Correcting Codes (CRC Press, 2015), and
decoders, do not change with n − k. co-edited the book Wireless Security and Cryptography: Specifications and
Implementations (CRC Press, 2007). Her current research interests include
VI. C ONCLUSION VLSI architecture design for error-correcting coding, signal processing, and
cryptography.
This paper presented two modified transformed encoders Dr. Zhang is a member of the CASCOM and VSA technical committees,
for QCNB-LDPC codes over subfields. The proposed designs and DISPS Technical Committee Advisory Board of the IEEE. She received
the National Science Foundation CAREER Award in 2009 and the Best Paper
can also be applied to binary codes, which are a special case Award at the ACM Great Lakes Symposium on VLSI in 2004. She served on
of codes over subfields. Compared with the previous Fourier- the technical program and reviewer committees of many conferences. She has
transformed encoder, the proposed designs have much simpler been an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS AND
SYSTEMS I since 2010.
data flow and avoided large buffers. In addition, novel archi-
tectures are developed to implement the Fourier transform and
inverse by exploiting composite field arithmetic and conjugacy
constraints. With these optimizations, the proposed designs
have substantially smaller area requirement than the traditional Ying Tai (S’05–M’06–SM’12) received the Ph.D. degree from the University
encoder. The memory used for storing the generator matrix of California at Davis, Davis, CA, USA, in 2006.
He held the senior technical and management positions with SanDisk
occupies a large part of the encoder. Future work will be Corporation, Milpitas, CA, USA, Cadence Design Systems Inc., San Jose,
directed to reducing this storage requirement. CA, USA, Boeing, Chicago, IL, USA, and Ikanos Communications Inc.,
Fremont, CA, USA. His current research interests include coding theory,
ACKNOWLEDGMENT signal processing, and their applications to data communications and storage
systems.
The authors would like to thank Xiaoheng Chen for provid- Dr. Tai was a recipient of the IEEE Communications Society
ing the synthesis results. Stephen O. Rice Prize in the field of communications theory in 2007.