0% found this document useful (0 votes)
18 views10 pages

Low-Complexity Transformed Encoder Architectures For Quasi-Cyclic Nonbinary LDPC Codes Over Subfields

This document discusses two low-complexity encoder architectures for quasi-cyclic nonbinary LDPC codes over subfields. It introduces two modified transformed encoders that simplify the data flow and eliminate large buffers compared to prior approaches. The advantages are shown using examples, and synthesis results find area reductions of up to 22% compared to traditional encoders.

Uploaded by

vlsi project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

Low-Complexity Transformed Encoder Architectures For Quasi-Cyclic Nonbinary LDPC Codes Over Subfields

This document discusses two low-complexity encoder architectures for quasi-cyclic nonbinary LDPC codes over subfields. It introduces two modified transformed encoders that simplify the data flow and eliminate large buffers compared to prior approaches. The advantages are shown using examples, and synthesis results find area reductions of up to 22% compared to traditional encoders.

Uploaded by

vlsi project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Low-Complexity Transformed Encoder


Architectures for Quasi-Cyclic Nonbinary
LDPC Codes Over Subfields
Xinmiao Zhang, Senior Member, IEEE, and Ying Tai, Senior Member, IEEE

Abstract— Quasi-cyclic low-density parity-check (QC-LDPC) This paper focuses on the cases that the dimension of the
codes are adopted in many digital communication and storage CPMs is (2r − 1) × (2r − 1) and p divides r . Accordingly,
systems. The encoding of these codes is traditionally done GF(2 p ) is a subfield of GF(2r ). These cases are quite versatile
by multiplying the message vector with a generator matrix
consisting of dense circulant submatrices. To reduce the encoder and binary codes are a special case with p = 1.
complexity, this paper introduces two schemes making use of LDPC code design usually starts with the H matrix. If H
finite Fourier transform. We focus on QC-LDPC codes whose has a lower triangular structure, then the encoding can be
circulant submatrices are of dimension (2 r − 1) × (2 r − 1) and done by using H directly. However, such an H matrix often
the entries are elements of GF(2 p ), where p divides r, and hence, has at least a few columns with low weight. This makes the
GF(2 p ) is a subfield of GF(2 r ). These cover a broad range
of codes, and binary LDPC codes are a special case. Making corresponding bits more likely stay at wrong values, especially
use of conjugacy constraints, low-complexity architectures are in bit-flipping type of decoders, and hence leads to earlier error
developed for finite Fourier and inverse transforms over subfields floor. A more general encoding method that applies to code
in this paper. In addition, composite field arithmetic is exploited with any H is to derive a systematic generator matrix G from
to eliminate the computations associated with message mapping H first, and then multiply the message vector with G [5].
and reduce the complexity of Fourier transform. For a (2016,
1074) nonbinary QC-LDPC code whose generator matrix consists The computed G consists of blocks of circulant matrices,
of circulants of dimension 63 × 63 with GF(22 ) entries, the pro- and the circulants corresponding to the parity symbols are
posed encoders achieve 22% area reduction compared with the usually dense. This means that a large number of multipliers
conventional encoders without sacrificing the throughput. are required for encoding.
Index Terms— Encoder, finite field, Fourier transform, A transformed approach was proposed in [6] to reduce
low-density parity-check (LDPC) codes, nonbinary (NB), the encoder complexity of binary QC-LDPC codes. Applying
quasi-cyclic (QC), VLSI architecture. finite Fourier transform, each dense circulant becomes a diago-
nal matrix, and hence, the number of multipliers needed for the
I. I NTRODUCTION transformed generator matrix multiplication is greatly reduced
at the cost of Fourier transform and some message mapping.
B INARY low-density parity-check (LDPC) codes are
broadly adopted in digital communication and storage
systems. Compared with binary codes, nonbinary LDPC
Later, two improved encoding methods making use of finite
Fourier transform were described in [7]. They are the binary
(NB-LDPC) codes, whose entries in the parity check and cases of the encoders over subfields developed independently
generator matrices are GF(2 p ) ( p > 1) elements, can achieve in [8] and [9]. No hardware implementation architecture was
higher coding gain when the code length is moderate and have provided in [7]. If implemented directly according to the
better burst error-correction performance. NB-LDPC codes formulas, the overheads brought by the Fourier transform,
are also finding their ways into practical systems due to message mapping, and their inverse may offset the savings
recent work on decoder complexity reduction, such as those achieved by the simplified generator matrix multiplication.
in [1]–[3]. This paper first presents the two modified transformed
Quasi-cyclic LDPC (QC-LDPC) codes enable efficient encoders for QCNB-LDPC codes over subfields described
partial-parallel processing because of the regularity in their in [8] and [9] with more details. Unlike the approach in [6],
parity-check matrices H . The H matrix of a binary QC-LDPC the modified encoders do not require row or column per-
code consists of cyclic permutation matrices (CPMs), and mutation on G. The first encoder maps the messages in an
that of a QCNB-LDPC code can be designed by replacing intrablock instead of interblock way, and the second encoder
each “1” in the CPMs with elements of GF(2 p ) ( p > 1) [4]. does not need message mapping at all. As a result, the data
The dimension of the CPMs is usually moderate or large in flow is simplified and large buffers are eliminated. By making
order to reach a reasonably long codeword length. On the other use of the property that the inputs (outputs) are subfield
hand, p needs to be small to keep the decoder complexity low. elements and outputs (inputs) satisfy the conjugacy constraints,
efficient partial-parallel architectures are developed for the
Manuscript received July 2, 2016; revised September 29, 2016; accepted Fourier (inverse) transform over subfields. The simplified
November 12, 2016. architectures are made possible by exploiting composite field
The authors are with Western Digital Corporation, San Jose, CA 95138
USA. arithmetic, which also contributes to the elimination of the
Digital Object Identifier 10.1109/TVLSI.2016.2630055 computational complexity associated with the message map-
1063-8210 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

ping. These architectures have been described in a patent


application [10] by the authors.
The advantages of the proposed architectures are shown
by using a (2016, 1764) QCNB-LDPC code over GF(4)
in this paper. From synthesis reports using 28-nm process,
the proposed modified transformed encoders achieve 22%
area reduction compared with the traditional encoder that
directly multiplies the message vector with the generator
matrix, while maintaining the same throughput. In addition,
the second proposed encoding approach has lower overall
power consumption than the first, since the inverse Fourier Fig. 1. Format of transformed generator matrix for a QC-LDPC code.
transform is activated fewer times. The proposed schemes
are also applied to an (8128, 6985) binary QC-LDPC code.
Architectural analysis shows that the area requirement is element of GF(2r ), and V = [α −i j ] (0 ≤ i, j < e) is an e × e
reduced by 16% using the proposed schemes. matrix over GF(2r ) whose entry in the i th row and j th column
This paper is organized as follows. Section II introduces is α −i j . The Fourier transform of an e-tuple vector over GF(2r )
the traditional encoder for QCNB-LDPC codes and the prior w = [w0 , w1 , . . . , we−1 ], denote by F (w), is computed as
Fourier-transformed encoder in [6]. The proposed modified ⎡ ⎤
1 1 1 1 ···
transformed encoding methods are presented in Section III. ⎢1
⎢ α −1 α −2 α −3 ···⎥ ⎥
Section IV develops efficient VLSI architectures for the pro- ⎢1
F (w) = wV = w ⎢ α −2 α −4 α −6 ···⎥ ⎥.
posed encoders. Hardware complexity analysis is provided ⎢ .. ⎥
.. .. .. . .
in Section V, and the conclusions are drawn in Section VI. ⎣. . . . .⎦
1 α −(e−1) α −(e−2) α −(e−3) · · ·
II. E NCODING OF QCNB-LDPC C ODES
The inverse of V is V −1 = [α i j ], and the inverse Fourier
QCNB-LDPC codes over GF(2r ) can be constructed by the transform of a vector w is done as
methods in [4], which give H matrices consisting of subma- ⎡ ⎤
trices of dimension (2r − 1) × (2r − 1). To keep the decoder 1 1 1 1 ···
complexity low, elements of low-order finite field GF(2 p ) are ⎢1 α α2 α3 ···⎥
⎢ ⎥
−1 −1 ⎢1 α 2 α 4 α 6 ···⎥
used as the entries of H . This paper focuses on the cases F (w) = wV = w ⎢ ⎥.
that p divides r , and hence, GF(2 p ) is a subfield of GF(2r ). ⎢ .. .. .. .. . . ⎥
⎣. . . . .⎦
For QC-LDPC codes, a systematic generator matrix G can be 1 α e−1 α e−2 α e−3 · · ·
computed from H [5]. To simplify the notations, let e = 2r −1.
If H is full rank, then the computed G consists of circulant It has been shown in [6] that if W is a circulant matrix and
matrices. Assume that the dimension of H is (n − k)e × ne. the first row of W is w, then
The computed G is expressed as
⎡ ⎤ W F = V −1 W V = diag(d0 , d1 , . . . , de−1 ),
I 0 ··· 0 G 0,0 ··· G 0,n−k−1
⎢0 I ··· 0 G 1,0 ··· G 1,n−k−1 ⎥ where [d0 , d1 , . . . , de−1 ] equals F (w). The diagonal matrix
⎢ ⎥
G=⎢. . .
. . . .
. .
. . . .. ⎥ W F is called the Fourier transform of W . Let V −1(k) (V (n) )
⎣. . . . . . . ⎦
be a k × k (n × n) diagonal array of V −1 (V ). The Fourier
0 0 ··· I G k−1,0 ··· G k−1,n−k−1 transform of G is defined as G F = V −1(k) GV(n) . Then,
where I is an e × e identity matrix, and each G i, j (0 ≤ i < k, the transformed encoding is done as
0 ≤ j < n − k) is an e × e circulant. If H is not full rank,
c = cF V −1(n) = (mG F )V −1(n) . (1)
G has additional irregular rows. These rows can be eliminated
from G through padding zeros to the message symbols. The c value computed using (1) satisfies cHT = 0 and hence
The encoding computes c = mG, where m = is a codeword [6]. Fig. 1 shows the format of the transformed
[m 0 , m 1 , . . . , m ke−1 ] is the message vector and c is the generator matrix, G F . The first ke columns are still an identity
codeword. In a circulant matrix, each row is the previous matrix, and the last (n − k)e columns for the parities are
row cyclically shifted by one position. Due to this prop- diagonal matrices of dimension e × e. As a result, multiplying
erty, the encoder of a QC-LDPC code is implementable by a message symbol with G F only requires n − k multipliers.
shift-register architectures [5]. However, unlike H , G i, j is Even if m is over a subfield, the c computed from (1) is
rather dense, and the nonzero entries in each row of G i, j over GF(2r ), unless conjugacy constraints are satisfied in each
appear at irregular locations. Therefore, one multiplier is block of e symbols in the transformed codeword cF = mGF .
adopted for each parity column in the multiplication of m Reference [6] focuses on the case that m is binary. An e-tuple
by G. As a result, the encoder needs a large number of vector w is binary if and only if the symbols in its Fourier
multipliers and hence has high silicon area requirement. transform d = F (w) satisfy the conjugacy constraints that
A Fourier-transformed encoding method was proposed
in [6] for QC-LDPC codes. Assume that α is a primitive d(2i)e = di2 .
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG AND TAI: LOW-COMPLEXITY TRANSFORMED ENCODER ARCHITECTURES FOR QCNB-LDPC CODES OVER SUBFIELDS 3

Here, (2i )e denotes 2i mod e. To make the encoder output III. M ODIFIED F INITE F OURIER -T RANSFORMED
binary for binary messages, it was proposed in [6] to per- E NCODERS OVER S UBFIELDS
mute G F and carry out mappings on m. The rows of G F If the message symbols are elements of GF(2 p ), the code-
are permuted according to πrow = [0, e, 2e, . . . , (k − 1)e, 1, word should be also over GF(2 p ) to reduce the storage require-
e + 1, 2e + 1, . . . , (k − 1)e + 1, . . . , e − 1, 2e − 1, 3e − 1, . . . , ment and decoding complexity. The conjugacy constraints can
ke − 1]. It means that row e of G F becomes the second row be extended to make the transformed encoding outputs in
after permutation, row 2e of G F becomes the third row after the same finite field. An e-tuple vector w is over GF(2 p ),
permutation, and so on. The columns are also permuted in a a subfield of GF(2r ), if and only if the symbols in its
similar way. The permuted and transformed generator matrix, Fourier transform d = F (w) satisfy the following conjugacy
G F ,π , only has e nonzero matrices of dimension k × n in the constraints [11]:
diagonal. In addition, an e-tuple vector formed by the entries p
in the same row and column inside each of these e submatrices d(2 p i)e = di2 . (4)
satisfies the conjugacy constraints. This is referred to as that
The corresponding cyclotomic cosets are e, {i, (2 p i )
e, (22 p i )
the conjugacy constraints are satisfied in an interblock way.
. . . , (2(ηi −1) p i )e }, and the cardinality, ηi , of each coset is a
m is also divided into e blocks of k messages and mapped to
divisor of r/ p.
m̂ to satisfy the conjugacy constraints in an interblock way.
In the transformed encoding of [6], G F is permuted to
The indices of the elements satisfying the conjugacy con-
satisfy the conjugacy constraints in an interblock manner.
straints form cyclotomic cosets {i, (2i )e , (22 i )e , . . . (2ηi −1 i )e }.
Then, after the mapped messages are multiplied, the product
The cardinality of each coset, ηi , must be a divisor of r .
is reversely permuted to satisfy the conjugacy constraints in an
Let {βi,0 , βi,1 , . . . , βi,ηi −1 } be a basis of the subfield GF(2ηi ).
intrablock way. However, the entries in each of the diagonal
The message mapping is done for each cyclotomic coset and
submatrices of G F already satisfy the constraints (4) in an
μ = 0, 1, . . . , ηi − 1, j = 0, 1, . . . , k − 1 as
intrablock way when the entries of G are over GF(2 p ). If the
⎛ ⎞2μ massage vector multiplied to G F satisfies the constraints in
ηi −1
the same way, then no permutation is required on G F and the
m̂ (2μ i)e k+ j = ⎝ βi,l m (2l i)e k+ j ⎠ . (2) product does not need to be reversely permuted. Inspired by
l=0
this, two modified transformed encoders are proposed next.
As a result, m̂GF ,π satisfies the conjugacy constraints in
an interblock way. If reversely permuted, it would become A. Modified Transformed Encoder A
a vector that satisfies the constraints in an intrablock man- Divide m into k blocks of e messages. Each block is mapped
ner, i.e., each block of e consecutive symbols satisfies the to a block of e messages in m̄ that satisfy the conjugacy
constraints. Accordingly, binary codewords are derived as constraints in (4) according to
follows [6]: ⎛ ⎞2pμ
ηi −1

c = ĉ F ,π −1
V −1(n)
= (m̂G F ,π π −1
) V −1(n)
. (3) m̄ ej +(2pμ i)e = ⎝ βi,l m ej +(2pl i)e ⎠ , (5)
l=0
The c value computed from (3) is different from mG and that where μ = 0, 1, . . . , ηi − 1, j = 0, 1, . . . , k − 1,
in (1). However, all of them are valid codewords. and {βi,0 , βi,1 , . . . , βi,ηi −1 } is a basis of GF((2 p )ηi ). Using
Despite that computing m̂GF ,π requires much fewer multi- m̄, the encoding is modified as
pliers than calculating mG, the transformed encoding accord-
ing to (3) has drawbacks. The indices in a cyclotomic coset can c = c̄F V −1(n) = (m̄GF )V −1(n) . (6)
be quite far apart. This requires large buffers to carry out the For a given m, the c value computed from (6) is not the same
interblock mapping in (2). The permutations on G F and ĉF as those from (3) or the original encoding. However
complicate their storage and access. Besides the message
mapping in (2) and inverse Fourier transform in (3), inverse cHT = (m̄GF )V −1(n) H T
mapping and Fourier transform are needed in the decoder. = m̄V−1(k) GV(n) V −1(n) H T
The overhead of these computations may offset the saving
= m̄V−1(k) GHT = 0
brought by the less-complicated multiplication of m̂ and G F ,π .
The work in [6] is limited to binary codes and provided only and hence, it is also a valid codeword. Since each block of
theoretical complexity analysis. In Section III, the transformed e symbols in m̄G F satisfies the conjugacy constraints, the
encoding is first extended to the cases that the message sym- encoding using (6) also generates a codeword over GF(2 p ).
bols are elements of a subfield. Then, two modified encoding In (5), the mapping is done based on the messages in
schemes, A and B, are described. The encoder in the trans- the same block. Assume that the input messages arrive in a
formed domain and that by Galois Fourier transform in [7] partial-parallel way. The extra message buffer needed in the
are the binary cases of encoders A and B, respectively. Joint proposed encoding is only for one single block of e messages.
algorithmic and architectural optimizations are proposed in On the other hand, the interblock message mapping in (2) may
the section after to develop efficient hardware implementation involve messages from blocks quite far apart. For example,
architectures for our modified transformed encoders. when r = 6 and p = 2, one cyclotomic coset is {3, 12, 48},
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

and hence, message blocks 3–48 need to be buffered for out alternatively as
the corresponding message mapping. Moreover, the proposed
c = m F G F V −1(n) . (7)
modified encoding does not need any row or column permu-
tation. This simplifies the access of the memory storing the It can be easily proved that the c value from (7) satisfies
generator matrix, and no buffer or routing network is needed cHT = 0, and hence is also a valid codeword.
on m̄GF . As a result, the intrablock message mapping and
modified encoding in (6) lead to not only substantial buffer IV. VLSI A RCHITECTURES FOR M ODIFIED E NCODERS
reduction but also much simpler dataflow.
To simplify the notations, this section uses a QCNB-LDPC
Although [6] and [7] mentioned that a basis is needed for
code over GF(22 ) ( p = 2) whose generator matrix consists of
message mapping, they did not specify what type of bases
circulants of dimension 63×63 (r = 6) to explain the proposed
should be used to reduce the implementation complexity.
encoder architectures. Our architectures can be easily extended
The mapping is done as basis element multiplications followed
to codes with different p and r values, such that p|r . Binary
by additions. However, if the basis is not chosen properly,
codes belong to these cases with p = 1. Each component is
it may be very difficult to recover m from the mapped
detailed before the overall encoder architectures are summa-
messages in the decoder. For cyclotomic cosets of the same
rized. The same composite field arithmetic, Fourier and inverse
cardinality, the same basis is used. Moreover, m consists of
transform architectures, and generator matrix multiplication
elements of GF(2 p ), a subfield of GF(2r ) from which the sym-
architecture are used in both encoders.
bols of m̄ are drawn. Intuitively, the composite field GF((2 p )t )
(t = r/ p), which is isomorphic to GF(2r ), can be adopted
to reduce the mapping complexity. Let f (x) be a degree-t A. Composite Field Construction
polynomial that is irreducible over GF(2 p ). An element of To construct GF((2 p )t ) from GF(2 p ), a degree-t irreducible
GF((2 p )t ) can be expressed as a0 + a1 x + · · · + at −1 x t −1, polynomial over GF(2 p ), f (x), is needed. As aforementioned,
where as ∈ GF(2 p ) (0 ≤ s < t) and x is a root of f (x). ηi an element a ∈ GF((2 p )t ) can be represented as a polynomial
is a divisor of t. When ηi = t, a good basis to use for the in x with degree t − 1, at −1 x t −1 + · · · + a1 x + a0 , whose
message mapping is {βi,0 , βi,1 , . . . , βi,t −1 } = {1, x, . . . , x t −1 }. coefficients are elements of GF(2 p ). Finite field multiplica-
In this case, proper symbols of m are collected to be the as tion is defined as polynomial multiplication modulo f (x),
coefficients of each m̄ ej+i in composite field representation. and field addition is done as polynomial addition. GF(22 ) is
This does not require any logic computation. Then, m̄ ej+(2pμ i)e constructed using f 0 (x) = x 2 + x + 1. It is the only degree-2
for μ > 0 are computed by iteratively taking the 2 p th power irreducible polynomial over GF(2). However, for constructing
of m̄ ej+i , which is implemented by simple logic for finite fields GF((22 )3 ) from GF(22 ), there are quite a few options for the
of characteristic two. Similarly, the inverse mapping is done by irreducible polynomial. To reduce the complexity of finite field
assigning the as coefficients in each m̄ ej+i to proper symbols multiplications, an irreducible polynomial with fewer nonzero
in m. Hence, the inverse mapping does not require any logic terms is preferred. For example, f 1 (x) = x 3 + φ, where
gate either. If ηi = 1, no mapping is needed. When 1 < ηi < t, φ = 10 ∈ GF(22 ), is a degree-3 irreducible polynomial over
each of the β value in the basis should be an element of the GF(22 ) with the fewest nonzero terms.
subfield GF((2 p )ηi ). Among all the elements of GF((2 p )ηi ), The Fourier transform over finite fields is defined using
pick ηi of them that have nonzero as at distinct locations. They power representation α i j , where α is a primitive element
form a basis of the subfield, and the mapping using this basis of GF(2r ). If the irreducible polynomial f (x) used for finite
can be done in a similar way as in the case of ηi = t. field construction is primitive, then α can be considered as
Besides making the message mapping a very easy process, a root of f (x) and the power representation is converted to
adopting composite fields has other benefits. Composite field polynomial representation easily by using the property that
multipliers have lower complexity than the multipliers in the f (α) = 0. Unfortunately, the irreducible polynomial with the
original field. They also enable simplifications of the Fourier fewest nonzero terms for composite field construction is not
transform over subfields as will be shown in Section IV. necessarily a primitive polynomial. For example, f 1 (x) =
A variation of the encoder described in this section for x 3 + φ is irreducible but not primitive. On the other hand,
binary codes was also presented in [7]. m is directly mul- all degree-3 primitive polynomials over GF(22 ) have four
tiplied with G F . Then the products are mapped using basis nonzero terms. If a nonprimitive irreducible polynomial, such
{βi,0 , βi,1 , . . . , βi,ηi −1 } according to an equation similar to (5) as f 1 (x), is employed for field construction, the primitive
in order to satisfy the conjugacy constraints. Since the symbols element α used in V and V −1 cannot be a root of f 1 (x),
of mGF are elements of GF(2r ), the mapping in this case and a mapping needs to be developed to convert those power
requires actual multiplications with the basis elements. representations to polynomial representations. Two different
representations of finite field elements should be isomorphic.
The identity element in polynomial representation should
B. Modified Transformed Encoder B
be mapped to the identity element in power representation.
Another vector that satisfies the conjugacy constraints in an x, a root of f 1 (x) in composite field representation, should
intrablock manner is m F , which is derived by applying Fourier be mapped to a power representation, α u , that has the same
transform to each block of e symbols in m. Since m F G F also order and is also a root of f 1 (x). Once the mapping of α u is
satisfies the conjugacy constraints, the encoding can be carried decided, the other elements in the same conjugacy class are
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG AND TAI: LOW-COMPLEXITY TRANSFORMED ENCODER ARCHITECTURES FOR QCNB-LDPC CODES OVER SUBFIELDS 5

derived easily. x 2 is mapped to α 2u etc. After that, pick any


two powers of α that have already been mapped, and their
sum and product should be mapped to the sum and product,
respectively, of the corresponding elements in polynomial
representation. This process is repeated until the mapping of
all powers of α is decided. Since the entries of V and V −1
Fig. 2. Message mapping for GF((22 )3 ) elements.
are known, such mapping is precomputed and does not incur
any hardware overhead.

B. Intrablock Message Mapping


Let β ∈ GF(26 ). The conjugates of β with respect to GF(22 )
2
are β, β 4 , β 4 , · · · . Hence, assuming that α is a primitive
element of GF(26 ), the conjugacy classes of GF(26 ) with
respect to GF(22 ) are
{1} {α 21 } {α 42 }
{α, α 4 , α 16 } {α 2 , α 8 , α 32 } {α 3 , α 12 , α 48 }
{α 5 , α 20 , α 17 } {α 6 , α 24 , α 33 } {α 7 , α 28 , α 49 }
{α 9 , α 36 , α 18 } {α 10 , α 40 , α 34 } {α 11 , α 44 , α 50 } Fig. 3. Partial-parallel transformed generator matrix multiplication
. architecture.
{α 13 , α 52 , α 19 } {α 14 , α 56 , α 35 } {α 15 , α 60 , α 51 }
{α 22 , α 25 , α 37 } {α 23 , α 29 , α 53 } {α , α , α }
26 41 38
requirement is reduced. In addition, each block of e symbols
{α 27 , α 45 , α 54 } {α 30 , α 57 , α 39 } {α 31 , α 61 , α 55 } in the products m̄GF and m F G F also satisfies the conjugacy
{α 43 , α 46 , α 58 } {α 47 , α 62 , α 59 } constraints. Hence, unlike it was suggested in [6] and [7],
The corresponding cyclotomic cosets are only the leading symbols in the products are computed in our
designs, and the others are recovered if needed by 2 p th power,
{0} {21} {42} whose complexity is lower than that of a general multiplier.
{1, 4, 16} {2, 8, 32} {3, 12, 48} As a result, substantially fewer multipliers are needed in the
{5, 20, 17} {6, 24, 33} {7, 28, 49} transformed generator matrix multiplication. A similar idea
{9, 36, 18} {10, 40, 34} {11, 44, 50} was used in the encoder variation of [7] that computes mGF
. (8)
{13, 52, 19} {14, 56, 35} {15, 60, 51} to reduce the number of multiplications needed in mapping
{22, 25, 37} {23, 29, 53} {26, 41, 38} the products.
{27, 45, 54} {30, 57, 39} {31, 61, 55} Fig. 3 shows a partial-parallel architecture for the trans-
{43, 46, 58} {47, 62, 59} formed generator matrix multiplication. In each clock cycle,
Assume that l message symbols are input to the encoder at a l  leading symbols of m̄ or m F are simultaneously multiplied
time. The messages are buffered in a register array that holds e with all the nonzero entries in the corresponding rows of G F .
GF(2 p ) symbols. Once the e messages in a block have arrived, l  is chosen so that the throughput of the matrix multiplication
the mapping on them is done simultaneously in the proposed is the same as that of the input message symbols to avoid
encoder A. The cardinality of each cyclotomic coset in (8) is extra buffers. For example, when l = 8 and e = 63, it takes
either one or three. If the cardinality is one, no mapping is 63/8 = 8 clock cycles to get the messages for a block.
needed. For those with cardinality three, the same mapping The matrix multiplication involving the messages in a block
basis, {β0 , β1 , β2 } = {1, x, x 2 }, is used. Here, x is a root needs to be finished in eight clock cycles. In the case of
of f 1 (x). In this case, a symbol in m̄ whose index is the leading p = 2 and r = 6, there are 23 cyclotomis cosets. Therefore,
term of a cyclotomic coset is simply derived by collecting the multiplications of 23 leading symbols need to be completed
the three input message symbols whose indices belong to that in eight clock cycles. Hence, l  = 23/8 = 3. Since the
cyclotomic coset. For example, m̄ 2 = m 2 + m 8 x + m 32 x 2 . parity part of G F consists of n − k block columns of diagonal
The architecture for the intrablock message mapping is shown submatrices, n − k multipliers are needed for each of the
in Fig. 2. In this paper, the symbols whose indices are leading l  leading symbols in m̄ or m F . The intermediate products
terms of cyclotomic cosets are called the leading symbols. for all parity columns are stored in shift registers. In every
To reduce the complexity, only the leading symbols of m̄ need clock cycle, the output of each multiplier is accumulated to
to be derived as will be explained next. the intermediate result belonging to the same column of G F .
The registers are shifted by l  positions each time, so that
C. Transformed Generator Matrix Multiplication Architecture the connections between the multipliers and registers remain
unchanged.
The transformed generator matrix, G F , is in the format
shown in Fig. 1. Each e × e submatrix in the parity columns
is a diagonal matrix, and the entries in the diagonal satisfy D. Fourier Transform and Inverse Architectures
the conjugacy constraints. Therefore, only the leading sym- From (6) and (7), inverse Fourier transform needs to be
bols in those diagonals need to be stored, and the memory applied to blocks of e symbols in the proposed encoders.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 4. Simplified inverse Fourier transform architecture by making use of


subfield elements.
Fig. 5. Simplified inverse Fourier transform architecture by making use of
conjugacy constraints.
Take the e symbols in a block as the coefficients of w(x) =
w0 + w1 x + · · · + we−1 x e−1 . The inverse Fourier transform
is equivalent to finding the evaluation values of w(x) over
1, α, . . . , α e−1 . These computations can be implemented by The inverse Fourier transform architecture may be further
Chien search, whose architecture is available in many publi- simplified by making use of the property that the inputs,
cations, such as [12]. An l-parallel Chien search architecture wi , satisfy the conjugacy constraints. ηi is a divisor of t for
computes l evaluation values at a time, and needs l(e − 1) composite field GF((2 p )t ). Actually, the ηi values for most
constant multipliers and l(e − 1) adders over GF(2r ) for cyclotomic cosets equal t as can be seen from (8). From
the conjugacy constraints, w(2pμ i)e α j (2 i) = wi2 α j (2 i) =
pμ pμ pμ
a degree e − 1 polynomial. Although constant multipliers pμ
are less complicated than general multipliers, such a large (wi α ) for μ = 1,
ji 2
2, . . . , ηi −1. The trace of a ∈ GF(q t ) is
t −1 q i
number of constant multipliers and adders over GF(2r ) would defined as Tr(a) = i=0 a . Therefore, for those cyclotomic
ηi −1 j (2pμ i) = T r (w α j i ).
offset the hardware savings achieved by the less complicated cosets with ηi = t, μ=0 w(2 i)e α
pμ i
transformed generator matrix multiplication. This means that the sum of the evaluation values correspond-
By exploiting the conjugacy constraints and compos- ing to the wi values in each of those cyclotomic cosets is
ite field arithmetic, the Chien search architecture can be equal to the trace of the evaluation value corresponding to
substantially simplified to implement the inverse Fourier the leading wi . The trace of an element in GF((2 p )t ) is
transform over subfields. The first proposed architecture is always an element in GF(2 p ). For a = at −1 x t −1 + · · · +
shown in Fig. 4. This is modified from the architecture a1 x + a0 , Tr(a) = at −1Tr(x t −1) + · · · + a1 Tr(x) + a0 Tr(1).
in [12]. Although the inputs of the inverse Fourier trans- Tr(x t −1), . . . , Tr(x), Tr(1) can be precomputed. Then, the
form are elements of GF(2r ), the outputs are symbols in trace of an element in GF((2 p )t ) is derived easily by con-
subfield GF(2 p ). An element in composite field GF((2 p )t ) stant multiplications over GF(2 p ). Based on this, another
is written as at −1 x t −1 + · · · + a1 x + a0 , where as ∈ GF(2 p ) architecture for the inverse Fourier transform over subfields
(0 ≤ s < t). For those elements also belonging to subfield is proposed in Fig. 5. To clarify the indices of the inputs,
GF(2 p ), only a0 is nonzero, and a1 , a2 , . . . , at −1 are all zero. this architecture is for an example case of GF((22 )3 ) whose
The outputs of the gray-colored constant multipliers in Fig. 4 cyclotomic cosets are listed in (8). Instead of having one row
only go through adders, which are implemented as bit- of multipliers for each wi , as in Fig. 4, this architecture only
wise XOR, before the outputs of the inverse Fourier transform has one row of multipliers for each leading wi . The sum of
are derived. Therefore, those multipliers only need to compute the evaluation values corresponding to all the wi values in
the a0 coefficients in the products, and accordingly require each coset with cardinality t is derived by the trace. As a
around 1/t the complexity of pt-bit constant multipliers. As an result, the number of multipliers has been almost reduced by
example, consider the multiplication of b and c ∈ GF((22 )3 ), t times compared with that in Fig. 4. On the other hand, if t 
which is done as a = bc mod f 1 (x). It can be derived that of Tr(x t −1 ), . . . , Tr(x), Tr(1) are nonzero, then t  constant
⎧ multipliers over GF(2 p ) are needed to compute the trace.

⎨ a 2 = b 2 c0 + b 1 c1 + b 0 c2 The traces of the basis elements vary with the irreducible
a 1 = b 2 c2 φ + b 1 c0 + b 0 c1 . polynomial used to construct GF((2 p )t ). Proper irreducible

⎩ polynomials can be chosen to reduce the complexity of the
a 0 = b 2 c1 φ + b 1 c2 φ + b 0 c0
trace computation. For example, if f 1 (x) = x 3 + φ is used
Only the a0 coefficient in the product needs to be calculated to construct GF((22 )3 ) from GF(22 ), Tr(x 2 ) = Tr(x) = 0
if it is known that a ∈ GF(22 ). Compared with computing all and Tr(1) = 1. Accordingly, the trace blocks in Fig. 5 are
three coefficients of a, the complexity is around 1/3. totally eliminated, and the multipliers, except those in the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG AND TAI: LOW-COMPLEXITY TRANSFORMED ENCODER ARCHITECTURES FOR QCNB-LDPC CODES OVER SUBFIELDS 7

first column, only need to compute the a0 coefficients in the


products. Although it has not been mathematically proved
that there exist irreducible polynomials for every finite field,
such that only Tr(1) is nonzero, we have successfully found
such irreducible polynomials for GF((22 )4 ) and GF((21 )r )
with various r . Another advantage of the architecture in
Fig. 5 is that it does not require the nonleading symbols as
inputs, and hence, they do not need to be computed from the
transformed generator matrix multiplication.
Taking w(x) = w0 + w1 x + · · · + we−1 x e−1 as input, the
Fourier transform computes w(α − j ) for j = 0, 1 . . . , e − 1,
and can also be implemented by a Chien search architecture. Fig. 6. Simplified Fourier transform architecture.
It is carried out on the message symbols in encoder B.
When encoder A is employed, it is applied to the LDPC
decoder outputs to get m̄. In both cases, the inputs of the
Fourier transform are GF(2 p ) symbols, and its outputs are
GF((2 p )t ) elements. Moreover, only the leading symbols of
the outputs are needed in successive steps. As shown in (8),
the leading symbols have irregular indices. This makes it
very difficult to skip the computations of those nonleading
symbols in the Fourier transform using an outputwise partial-
parallel architecture, such as those in Figs. 4 and 5. Instead,
output-parallel and inputwise partial-parallel designs should
be exploited to take advantage of the conjugacy constraints to Fig. 7. (a) Architecture for modified transformed encoder A. (b) Decoder
reduce the hardware complexity. In an l-parallel architecture architecture when encoder A is used.
for the Fourier transform, the inputs are multiplied by powers
of αl . Although the inputs are GF(2 p ) symbols, the products
are GF((2 p )t ) symbols and are used as the inputs to the other only two constant multipliers over GF(22 ) in each premult
block, and each of them is implemented by one XOR gate.
constant multipliers. Hence, the other constant multipliers are
pt-bit multipliers. Therefore, subfield elements may not be Then, the products are switched and sent to the array of con-
stant multipliers according to (9). If using l = e/(2 p −1) or its
exploited in the same way as in the inverse Fourier transform
multiple is not sufficient to catch up with the throughput of the
to reduce the multiplier complexity. One solution is to set the
parallel processing factor l to e/(2 p −1) or its multiple. In this other units, then fully parallel design is adopted. The constant
multipliers in a fully parallel architecture also multiply GF(2 p )
case, αl and its powers are elements of GF(2 p ). Dividing the
Fourier transform inputs into groups of l symbols symbols with GF((2 p )t ) constants. However, if other parallel
processing factors are used, each constant multiplier needs to
w(α − j ) = w0 + w1 α − j + · · · + wl−1 α − j (l−1) compute the product of a GF((2 p )t ) symbol and a GF((2 p )t )
+ wl α − j l + wl+1 α − j (l+1) + · · · + w2l−1 α − j (2l−1) constant, and the design becomes cost-ineffective.
+ w2l α − j (2l) +w2l+1 α − j (2l+1) +· · · + w3l−1 α − j (3l−1) + · · ·
E. Modified Transformed Encoder Architecture A
= w0 + w1 α − j + · · · + wl−1 α − j (l−1)
Fig. 7(a) shows the overall architecture of the modified
+ wl α − j l + (wl+1 α − j l )α − j + · · · + (w2l−1 α − j l )α − j (l−1) transformed encoder A. Assume that l < e symbols of m
+ w2l α −2 j l +(w2l+1 α −2 j l )α − j +· · ·+(w3l−1 α −2 j l )α − j (l−1) are input at a time. They are stored into a shift register
+··· (9) inside the message mapping block that is able to hold e
symbols over GF(2 p ). After e symbols of m are collected,
Hence, each group of symbols can be multiplied by the same the leading symbols of the corresponding m̄ are derived
1, α − j , α −2 j , . . . , α − j (l−1) in the computation of w(α − j ) if according to the cyclotomic cosets by the architecture in Fig. 2.
they are premultiplied with α − j l , α −2 j l , . . . in subsequent The size of the shift register holding the leading symbols of m̄
clock cycles. By choosing l = e/(2 p − 1), only the least is er -bit, where e is the total number of cyclotomic cosets.
significant coefficients in the polynomial representations of Then, l  leading symbols of m̄ are shifted out at a time and
α − j l , α −2 j l , . . . are nonzero. As a result, the subsequent multiplied with the entries in the parity part of the transformed
multiplications with 1, α − j , α −2 j , . . . , α − j (l−1) are done by generator matrix G F using the architecture in Fig. 3. Only
simplified constant multipliers, whose complexities are also the leading symbols in each diagonal submatix of G F are
around 1/t the complexities of pt-bit constant multipliers. stored and multiplied. To avoid data buildup, l  should be
The proposed Fourier transform architecture is shown in chosen so that e/l = e /l  . The outputs of the matrix
Fig. 6. To clarify the indices, this architecture is also for the multiplication block are the parity symbols. While m̄ are sent
example case of GF((22 )3 ). Each premult block multiplies the to this block, they are also routed through to become the
input with α −l , α −2l , . . . , α −(2 −2)l . When p = 2, there are
p
systematic symbols. Both systematic and parity symbols need
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE II
H ARDWARE C OMPLEXITIES OF 8- PARALLE M ODIFIED T RANSFORMED
E NCODERS F ROM A RCHITECTURAL A NALYSIS FOR (2016,1764)
QCNB-LDPC C ODE OVER GF(22 ) W ITH r = 6, p = 2, AND n −k = 4

Fig. 8. Architecture for modified transformed encoder B.

TABLE I
blocks of symbols, as summarized in Table I. Accordingly,
N UMBERS OF F OURIER T RANSFORM AND I NVERSE N EEDED BY
THE M ODIFIED T RANSFORMED E NCODING /D ECODING S YSTEMS
its overall power consumption would be lower than that of
encoding approach A, especially for high-rate codes. However,
if lower complexity and power consumption are preferred in
the encoder itself, such as in the case that the encoder is
located in a power-constraint device and the decoder is not,
approach A is a better choice, since the Fourier transform is
located in the decoder, as shown in Fig. 7.
to be inverse Fourier transformed to derive the codeword.
When the inverse Fourier transform architecture in Fig. 4 is V. C OMPLEXITY A NALYSIS AND C OMPARISONS
employed, those nonleading symbols are needed, and they
are recovered in the message expanding block. This block This section analyzes the complexities of the proposed
computes the 2 p th powers of the leading symbols, and can encoders for two example codes. One is a (2016, 1764)
be implemented easily. When t is not small and/or some of QCNB-LDPC code over GF(22 ) whose G matrix consists
the traces of the basis elements are zero, the inverse Fourier of 28 × 32 blocks of 63 × 63 circulants. The other is a
transform architecture in Fig. 5 has lower complexity than that (8128, 6985) binary QC-LDPC code whose G matrix consists
in Fig. 4. The message expanding block is not required when of 55 × 64 circulants of dimension 127 × 127. For both
the architecture in Fig. 5 is used. codes, the proposed encoder architectures lead to substantial
When encoder A is adopted, extra steps are needed after complexity reduction.
the LDPC decoding, as shown in Fig. 7(b). According For the (2016, 1764) NB code, the Fourier transform
to (6), Fourier transform is applied to the decoding result to is defined over GF(26 ), and composite field GF((22 )3 ) is
derive m̄GF . Systematic symbols of m̄ are directly observed adopted. f0 (x) = x 2 + x + 1 is used to construct GF(22 )
from m̄GF . Then, m is recovered by inverse mapping. from GF(2), and GF((22 )3 ) is constructed from GF(22 ) by
using f 1 (x) = x 3 + φ, where φ = 10 ∈ GF(22 ). If x is
a root of f1 (x), then Tr(x 2 ) = Tr(x) = 0 and Tr(1) = 1.
F. Modified Transformed Encoder Architecture B As a result, the inverse Fourier transform architecture in Fig. 5
The first ke columns of G F are also an identity matrix. has lower complexity. The trace blocks in this architecture
From (7), the first k blocks of e symbols in the codeword are eliminated, and each of the constant multipliers, except
generated from encoder B are those in the first column, computes only the least significant
coefficient in the product. Additions over finite fields using
m F IV−1(k) = mV(k) IV−1(k) = m.
basis representations are bitwise XOR operations. Assume that
As a result, the Fourier transform and inverse for computing a two-input NAND gate takes 1/2 the area of an XOR. A GF(22 )
the systematic part of the codeword are canceled out in multiplier takes five XOR gates to implement and has three
encoder B, and the message symbols directly become the logic gates in the critical path. A GF((22 )3 ) multiplier requires
systematic part. Accordingly, encoder B is implemented by the around 54 XOR gates, and has six logic gates in the critical
architecture shown in Fig. 8. The inverse Fourier transform is path. Constant multipliers are much simpler than general
only applied to the parity symbols. Moreover, no extra step is multipliers, since one of the operands is fixed. For example,
needed after the LDPC decoding when encoder B is employed. a constant multiplier by “10” or “11” over GF(22 ) requires
Encoder B is implemented by the same transformed gen- one XOR gate, and multiplying an element in GF(22 ) by
erator matrix multiplication, Fourier transform, and inverse α 3 ∈ GF((22 )3 ) as needed in the inverse Fourier transform
architectures as in encoder A. It does not need the message architecture is implementable by three XOR gates.
mapping, which is implemented as hard-wiring. Hence, the Assume that l = 8 input message symbols are available at a
area requirements of the two proposed encoding schemes time. To match the throughput of the input symbols, 8-parallel
are very similar. Nevertheless, in encoder B, the message processing is adopted in the inverse Fourier transform, and
symbols do not go through any transform before they become the parallel processing factor for the generator matrix mul-
the systematic part of the codeword. Therefore, encoder B tiplication is set to 3. Since G F has 32 − 28 = 4 parity
needs to activate the inverse Fourier transform for much fewer block columns, 3 × 4 = 12 general multipliers over GF((22 )3 )
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG AND TAI: LOW-COMPLEXITY TRANSFORMED ENCODER ARCHITECTURES FOR QCNB-LDPC CODES OVER SUBFIELDS 9

TABLE III TABLE V


C OMPARISON OF 8- PARALLE E NCODERS FOR (2016,1764) QCNB-LDPC H ARDWARE C OMPLEXITIES OF 16-PARALLEL M ODIFIED T RANSFORMED
C ODE OVER GF(22 ) F ROM A RCHITECTURAL A NALYSIS E NCODERS F ROM A RCHITECTURAL A NALYSIS FOR (8128, 6985)
B INARY LDPC C ODE W ITH r = 7 AND n − k = 9

TABLE IV
S YNTHESIS R ESULTS OF E IGHT-PARALLEL E NCODERS FOR (2016,1764)
QCNB-LDPC C ODE OVER GF(22 ) U SING TSMC 28-nm P ROCESS by dividing the total areas from the synthesis reports by that of
an XOR gate. The ROMs storing the matrix entries are actually
implemented by combinational logic. The synthesis tool is able
to optimize and combine a lot of cells. As a result, the gate
counts in Table IV are much smaller than those in Table III.
Nevertheless, architectural analysis is a good tool to compare
the relative complexities of different designs, and the synthesis
are needed in the matrix multiplication block. e/(2 p − 1) = results support similar percentage of area saving achievable by
21-parallel processing is employed in the Fourier transform our proposed designs as analyzed from architectures.
architecture. Although it runs at a speed that is higher than the The encoding algorithms making use of finite Fourier trans-
other components in the encoder, 8-parallel Fourier transform form in [6] and [7] are given without any hardware architec-
design is more costly as explained in Section IV. Table II ture. Implementing the Fourier transform and inverse using
shows the logic gate and memory requirements of the building standard architectures leads to large area and the message
blocks in the proposed encoders from architectural analysis. mapping may have significant overhead if the bases are not
Encoder A consists of the first three components in Table II. chosen properly. As it was mentioned, the two encoding algo-
However, for the purpose of fair comparison, the complexity rithms in [7] are the binary cases of the proposed algorithms.
of the Fourier transform needed in the decoder when encoder In the design of [6], buffering the input messages for interblock
A is used should also be included. Although encoder B does mapping alone needs 5796 registers for the example code,
not require message mapping, it needs registers to buffer a and the matrix permutations make the storage and access of
block of e message symbols before they are sent to the Fourier intermediate results very difficult. Hence, the comparison with
transform. Hence, the total area requirements of the two this encoder is not further pursued.
encoding schemes are very similar. Pipelining is applied before When l increases, the complexity of the over-qualified
and after the generator matrix multiplication in both proposed Fourier transform engine becomes relatively smaller in the
encoders. It only brings 2 × 63/8 = 16 clock cycles of extra overall modified transformed encoders. Hence, the proposed
initial latency. The encoding of a message vector still takes designs would achieve more significant saving over the orig-
32 × 63/8 = 256 clock cycles. inal encoder for larger l. The number of conjugacy classes
The proposed encoders are compared with an l = 8-parallel in GF((2 p )t ) with respect to GF(2 p ) is around 2 pt /t. As t
traditional encoder that computes c = mG using the archi- increases, the G F multiplication needs even fewer multi-
tecture from [5] in Table III. Since the original generator pliers compared with the multiplication by G. The relative
matrix G is dense in the parity columns and the nonzero entries complexities of GF((2 p )t ) multipliers needed in the G F
appear at irregular locations, (n−k) ≤ multipliers over GF(22 ) multiplication and GF(2 p ) multipliers required in multiplying
are employed to take care of the multiplications of all parity G depend on many factors, such as the irreducible polynomials
columns in l rows simultaneously. Similarly, the l products used for field constructions, and whether t is further decom-
computed for each column are accumulated to intermediate posable. Nevertheless, even if p = 1 and a GF(2) multiplier is
results stored in registers. Assuming that each register takes just a single AND gate in the original encoder, the proposed
three times the area of an XOR gate and storing a bit in memory modified transformed encoders can still achieve substantial
needs about the same area as an XOR, the proposed encoders gate count reduction when r = pt is not small.
reduce the area requirement by 19%. Since the critical path and Now, consider a (8128, 6985) binary QC-LDPC code whose
the number of clock cycles needed for encoding each message generator matrix has nine block parity columns of 127 × 127
vector are the same in the original and proposed encoders, their circulants. Assume that l = 16 bits are input to the encoder in
achievable throughput is the same. each clock cycle. The 16-parallel processing is also adopted
To further validate the advantages of our design, both in the inverse Fourier transform. Since there are 19 conjugacy
the proposed and traditional encoders are synthesized using classes in GF(27 ) with respect to GF(2), only l  = 3 multipli-
TSMC 28-nm process under timing constraint of 2.5 ns. ers over GF(27 ) are needed for each block of parity columns
The results are shown in Table IV. Both encoders achieve in the transformed generator matrix multiplication. A GF(27 )
6.3-Gb/s throughput. The equivalent gate counts are derived multiplier needs around 85 XOR gates to implement. Since
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE VI R EFERENCES
C OMPARISON OF 16-PARALLEL E NCODERS FOR (8128, 6985) B INARY
[1] X. Chen and C.-L. Wang, “High-throughput efficient non-binary LDPC
LDPC C ODE F ROM A RCHITECTURAL A NALYSIS
decoder based on the simplified min-sum algorithm,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 59, no. 11, pp. 2784–2794, Nov. 2012.
[2] F. Cai and X. Zhang, “Relaxed min-max decoder architectures for
nonbinary low-density parity-check codes,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 21, no. 11, pp. 2010–2023, Nov. 2013.
[3] X. Zhang, “Low-complexity min-max non-binary LDPC decoders,”
J. Commun., vol. 10, no. 11, pp. 836–842, Nov. 2015.
[4] B. Zhou, J. Kang, S. Song, S. Lin, K. Abdel-Ghaffar, and M. Xu,
“Construction of non-binary quasi-cyclic LDPC codes by arrays
and array dispersions,” IEEE Trans. Commun., vol. 57, no. 6,
pp. 1652–1662, Jun. 2009.
e/(2 p − 1) = e for binary codes, the partial-parallel Fourier [5] Z. Li, L. Chen, L. Zeng, S. Lin, and W. H. Fong, “Efficient encoding
of quasi-cyclic low-density parity-check codes,” IEEE Trans. Commun.,
transform architecture in Fig. 6 becomes a fully parallel vol. 54, no. 1, pp. 71–81, Jan. 2006.
design. The complexities of the proposed encoders for this [6] Q. Huang, L. Tang, Z. Wang, Z. Xiong, and S. He. (Jan. 2013). “A low-
binary LDPC code are summarized in Table V, and compared complexity encoding of quasi-cyclic codes based on Galois Fourier
transform.” [Online]. Available: https://arxiv.org/abs/1301.3220
with that of the conventional encoder in Table VI. Although [7] Q. Huang, L. Tang, S. He, Z. Xiong, and Z. Wang, “Low-complexity
GF(2) multipliers are two-input AND gates, the conventional encoding of quasi-cyclic codes based on Galois Fourier transform,” IEEE
encoder requires a large number of them. Hence, the multi- Trans. Commun., vol. 62, no. 6, pp. 1757–1767, Jun. 2014.
[8] X. Zhang and Y. Tai, “Encoder with transform architecture
plication with the original generator matrix has much higher for LDPC codes over subfields using message mapping,”
complexity. Despite the overheads of the buffers, Fourier U.S. Patent 2015 0 381 204, Dec. 31, 2015.
transform, and inverse, the proposed encoders require 84% [9] X. Zhang and Y. Tai, “Encoder for quasi-cyclic low-density
parity-check codes over subfields using Fourier transform,”
of the area to implement. The same number of clock cycles is U.S. Patent 2015 0 381 205, Aug. 30, 2016.
needed to generate a codeword. A GF(27 ) multiplier has eight [10] X. Zhang and Y. Tai, “Low complexity partial parallel architectures for
XOR gates in the data path. Although the proposed encoders Fourier transform and inverse Fourier transform over subfields of a finite
field,” U.S. Patent 2015 0 301 985, Apr. 22, 2014.
have longer critical paths than the conventional binary encoder, [11] J. L. Massey and S. Serconek, “A Fourier transform approach to
practical systems prefer using a single clock signal, and usu- the linear complexity of nonlinearly filtered sequences,” in Advances
ally, there are other components in the system with a longer in Cryptology (Lecture Notes in Computer Science), vol. 839. Santa
Barbara, CA, USA: Springer-Verlag, 1994, pp. 332–340.
critical path. In this case, replacing the conventional encoder [12] K. Lee, H.-G. Kang, J.-I. Park, and H. Lee, “A high-speed low-
by the proposed encoders for binary codes may not bring any complexity concatenated BCH decoder architecture for 100 Gb/s optical
throughput penalty for the overall system. communications,” J. Signal Process. Syst., vol. 66, no. 1, pp. 43–55,
Jan. 2012.
The binary encoder example described earlier demonstrated
that our proposed designs also achieve substantial complexity
reduction for codes of longer lengths and/or lower rates,
which translate to larger (n − k)e. The achievable complexity
reduction would be more significant for codes with larger n−k. Xinmiao Zhang (S’04–M’05–SM’10) received the Ph.D. degree from the
This is because the complexity of the entire logic part in the University of Minnesota, Minneapolis, MN, USA, in 2005.
traditional encoder is proportional to n −k. On the other hand, She joined SanDisk (Western Digital) Corporation, Milpitas, CA, USA,
in 2013. She has been a tenured Associate Professor with Case Western
the complexities of the Fourier transform and inverse, which Reserve University, Cleveland, OH, USA. She has authored the book VLSI
account for the majority of the logic part in the transformed Architectures for Modern Error-Correcting Codes (CRC Press, 2015), and
decoders, do not change with n − k. co-edited the book Wireless Security and Cryptography: Specifications and
Implementations (CRC Press, 2007). Her current research interests include
VI. C ONCLUSION VLSI architecture design for error-correcting coding, signal processing, and
cryptography.
This paper presented two modified transformed encoders Dr. Zhang is a member of the CASCOM and VSA technical committees,
for QCNB-LDPC codes over subfields. The proposed designs and DISPS Technical Committee Advisory Board of the IEEE. She received
the National Science Foundation CAREER Award in 2009 and the Best Paper
can also be applied to binary codes, which are a special case Award at the ACM Great Lakes Symposium on VLSI in 2004. She served on
of codes over subfields. Compared with the previous Fourier- the technical program and reviewer committees of many conferences. She has
transformed encoder, the proposed designs have much simpler been an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS AND
SYSTEMS I since 2010.
data flow and avoided large buffers. In addition, novel archi-
tectures are developed to implement the Fourier transform and
inverse by exploiting composite field arithmetic and conjugacy
constraints. With these optimizations, the proposed designs
have substantially smaller area requirement than the traditional Ying Tai (S’05–M’06–SM’12) received the Ph.D. degree from the University
encoder. The memory used for storing the generator matrix of California at Davis, Davis, CA, USA, in 2006.
He held the senior technical and management positions with SanDisk
occupies a large part of the encoder. Future work will be Corporation, Milpitas, CA, USA, Cadence Design Systems Inc., San Jose,
directed to reducing this storage requirement. CA, USA, Boeing, Chicago, IL, USA, and Ikanos Communications Inc.,
Fremont, CA, USA. His current research interests include coding theory,
ACKNOWLEDGMENT signal processing, and their applications to data communications and storage
systems.
The authors would like to thank Xiaoheng Chen for provid- Dr. Tai was a recipient of the IEEE Communications Society
ing the synthesis results. Stephen O. Rice Prize in the field of communications theory in 2007.

You might also like