A Novel Data Packing Techniques For QC-LDPC Decoder Architecture Applied To NAND Flash Controller

8th Global Conference on Consumer Electronics (GCCE)
A Novel Data Packing Technique for QC-LDPC

Decoder Architecture applied to NAND flash
controller
Longyu Ma Hong-fu Chou Chiu-Wing Sham
Department of Computer Science Department of Computer Science Department of Computer Science
The University of Auckland The University of Auckland The University of Auckland
Auckland, New Zealand Auckland, New Zealand Auckland, New Zealand
[email protected] [email protected] [email protected]
Abstract—This paper presented a data packing technique [3] when transferring from the flash memory to the system
for Quasi-Cyclic LDPC codes decoder applied to NAND flash DRAM memory. In the SoC applications, all of the boot
controller which has the great challenge of implementing long information is generally stored in the flash memory. The flash
code length to cause high complexity and routing congestion
while doing floor planning. Firstly, we introduce a proposed memory includes a number of partitions for the boot loader
shift network to reorder the channel data sequence. This leads code and the flash file system is created in the flash memory.
to a generic data packing architecture for the LDPC decoder. All of the boot information is generally stored in the flash
Secondly, the proposed LDPC architecture can overcome the memory. As discussed in [4], the DMA collaborate with the
complicated routing problem caused by random accessing of error control coding (ECC) block. The ECC engine is a critical
massage passing within the LDPC decoder. Based on the proposed
architecture, the hardware description has been synthesized on issue regarding system performance which leads to area and
Design Compiler using a TSMC 0.18um model and provide cost is dominated by the ECC decoder and comprising a high
FPGA-based floor planning with the implementing result. Syn- percentage of the flash controller. Quasi-Cyclic (QC) LDPC
thesis results also show that the area and throughput of the code [5] reduce the hardware implementation and achieve a
proposed decoder has desirable performance for the application desirable decoding performance comparing to computer gener-
of NAND flash controller.
Index Terms—NAND flash, LDPC code, Solid State Disc ated random codes, since QC-LDPC code still has the random
property within based matrix using cyclic shifted ordering to
construct the code. To consider a more efficient approach,
I. I NTRODUCTION
packing multiple passing messages into a single memory word
The market demand for non-volatile NAND flash memory benefit to the configurable datawidth and depth of embedded
has been increasing for a long time, since the development memory in an FPGA which is able to maximize the decoding
of new applications that are not depend on heavy hard disks throughput. Authors in [6] to improve the implementation of
and has been rapidly growing. Flash memory [1] serves the LDPC decoder has been proposed, they introduces some key
main non-volatile storage device, and the flash interface unit challenges such as memory access provides multiple messages
is applied for system on chip (SoC) products. Flash memory and duplication of the functional units to process the messages
also provides a low power solution for storage systems and, concurrently. Moreover, data alignment hardware needs to
it is worth mentioning the small size and the light form route the messages along with the appropriate functional unit
factor are the essential properties for this type of storage. which is required. However, it is difficult to manage well
The basic flash commands are provided by which can be in randomized LDPC messages passing network. A generic
used by the main central processing unit(CPU) to access data methodology of message packing technique need to address
from the flash memory. The flash memory is also used to of effective data process. Based on the configuration of block
initiate the boot process from the firmware and plays an RAM, we follow the thread in [6] and develop a different
important role for the storage device to execute the tasks aspect on viewing the randomized property which reorder the
which be performed by the main CPU. The flash interface input message sequence of decoder by the order of wordlength
unit has mainly provided a reliable component for graphics of block RAM. This approach can alleviate the disorder of
and multimedia processors and have been applied to digital randomized shifting property while decoding the message
televisions, car navigation systems, and mobile applications. passing algorithm. This is able to reduce routing overhead and
In order to support multimedia applications, flash controller redundant processing time for check node computation.
units have been optimized for large block read and write, as In this paper, we proposed a generic data packing technique
presented in [2] which provide the minimization of the main which is well-fitted to our proposed architecture and take ad-
CPU interaction and supports direct memory access (DMA) vantage of the configurable datawidth and depth of embedded
memory to provide flexible partial parallel decoder for solving l = 4 and than we pack the messages from Address 0 to
the complexity of routing cost. In Section II, we provide our 3. As shown in Fig. 1 (A), the natural order of packing the
proposed data packing technique. In Section III, the proposed messages into the memory. After interleaving the messages
LDPC architecture is presented. In Section IV, the Conclusion sequence, the messages can be packed into memory as shown
is presented. in Fig. 1 (B). The proposed data packing technique rely on
the reordering data sequence to pack the messages to the
II. T HE P ROPOSED DATA PACKING T ECHNIQUE identical address of BRAM and pop out while the message
The M × N parity matrix, denoted as H, of a QC-LDPC passing in order to ensure the messages are all sorted to the
code is set as an m × n array of p × p sub matrices which identical parity check node. As shown in Fig. 2, the natural
each sub matrix is either a p × p zero matrix or a cyclic- order messages pack into the example of memory which result
shifted identity matrix. The parity matrix is divided into n in the mismatching messages of the address Addr0 to the
block columns and m block rows as follows. identical check node. This lead to redundant time of accessing
  the address Addr1 to get the message 12. If the order of our
A0,0 A0,1 ... A0,n−1
 A1,0 A1,1 ... A1,n−1  design is getting larder, this kind of disorder will deteriorate
 
H= .. .. . ..  (1) the performance of LDPC decoder. As a result, this approach
 . . . . .  alleviate redundant time waiting and buffer the messages to
Am−1,0 Am−1,1 . . . Am−1,n−1 register which allow us to compute the check node processing
instantly while using the partial parallel architecture. As an
A binary n-tuple v = (v0 , v1 , . . . , vN −1 ) is a codeword of
example in Fig. 3, check node process can take the calculation
QC-LDPC code if and only if vH T = 0 and in an extension
to the bundle of messages after circular shifting. Since the
form as
message sequence is reordered according to the wordlength
v0 hi,0 + v0 hi,1 + . . . + vN −1 hi,N −1 = 0 (2) of BRAM, the randomized cyclic shifted property is unable
to affect our proposed architecture which leads to effectively
For each ith row of H, the operation of vH T = 0 provide a computing and less routing overhead.
modulo-2 check-sum. Regarding to the check-sum operation,
this is the essential computation for check node. As shown in III. T HE P ROPOSED LDPC DECODER ARCHITECTURE AND
the row of parity matrix H, the ith check node is required to THE S YNTHESIS R ESULT
compute the message on the jth variable node while hij = 1. The proposed LDPC decoder is presented in Fig. 4 which
Regarding to the check node processing, the all of the jth take an example of code length 10240 and (40,4) QC-LDPC
variable node related to ith check node should be accessed code. The QC-LDPC code is randomly generated by computer
by the given memory address. However, the partial parallel seed for the low error floor orientation. The parity length is
architecture is limited to the wordlength configuration of a 1024 with code rate 0.9. We use layer min-sum decoding for
block RAM. Without carefully design to the data alignment, 4 layers and maximum iteration is set to 8. We proposed a 5
randomized QC-LDPC shifted property may cause inferior groups of combining variable node process(VNP) and check
processing organization. Therefore, the nature order of mes- node process(CNP). In each group, there are 32 VNP and 16
sage is required to be modified in order to fit the randomized CNP for computing messages passing operation. After that,
QC-LDPC shifted property. Our proposed approach is shown messages are stored into RAM and pop out accordingly to the
that the circular shifted operation can effectively benefit to VNC. For our generic proposed architecture, the CNP only
the randomized message passing. After shifting with the order need to rely on simple circular shifter to match a bundle of
of the depth of RAM, each bundle of message stored within messages data on the corresponding party check ensemble.
individual memory address can easily be matched to the After CNP computing, the messages are passing to CMPtop
identical check node. As a result, the wider word configuration to make a decision on selecting the minimum value. The
of block RAM can provider better throughput of QC-LDPC controller provide the control signal to maintain the execution
decoder under a higher parallelism architecture. for the involving state. In Fig. 5, a computing schedule of
While the receiving messages are stored into LLR RAM, the the proposed architecture is presented. Regarding as a partial
sequence of messages needs to reorder for the data packing parallel architecture, decoding a block of block RAM which
technique. The reordering is related to the depth of RAM for store the messages of 8 Ai,j blocks is divided into 4 sections to
each address which store a package of message. For example, access the message and computing CNP and VNP. Due to the
the depth of RAM is l and each address in the RAM can wide word length of SRAM configuration, 2 message bundle
have p/l massage in a bundle. We should note that l should are accessed from individual address corresponding to 2 Ai,j
be a factor of p. As a result, the block interleaver [7] for the blocks. The CNP and VNP are sequentially processed as we
reordering is presented as π(i), where 0 <= i < N . are able to insert some filp flops to retiming the shortest path
π(i) = (p/l) ∗ (i mod l) + ⌊(i/l)⌋ (3) along the combinational circuit. As a result, we provide the
synthesis result for the proposed LDPC decoder architecture
where the operation ⌊ ⌋ present the nearest integer less than with 636.74K gate counts using a TSMC 0.18um technology.
i/l . In Fig. 1, we illustrate a simple case of p = 16 and The latency of decoding an QC-LDPC code is 6144 clock
cycle. With a clock frequency 320MHz, the throughput of
decoder can reach 633Mbps.
IV. C ONCLUSION
The proposed QC-LDPC decoder alleviate the complexity of
random accessing messages passing scenario which affects the
floor planning and complicated messages management to store
back into block RAM. We utilize a simple block interleaver
to reorder the messages sequence and circular shifter can
simply solve the routing congestion during the place and route
process. For the future research, a higher parallel processing
unit can increase the throughput without the cost of routing
difficulty. This generic architecture provide a insight of long
QC-LDPC code for the application of storage system.
Fig. 4. A proposed LDPC decoder architecture.
Fig. 1. Example of a packing memory.
Fig. 5. A computing schedule of the proposed architecture
R EFERENCES
[1] Yoshio Nishi, Advances in Non-volatile Memory and Storage Technol-
ogy, Electronic and Optical Materials: Woodhead Publishing, 2014.
[2] Yu Liu, Lixin Cheng and Xuguang Wang,“Commands scheduling opti-
mized flash controller for high bandwidth SSD application,” Solid-State
and Integrated Circuit Technology (ICSICT) IEEE 11th International
Conference on, pp.1-3, 2012
[3] L. Rota, M. Caselle, S. Chilingaryan, A. Kopmann, M. Webe, “A PCIe
DMA Architecture for Multi-Gigabyte Per Second Data Transmission,”
Nuclear Science IEEE Transactions on, vol. 62, pp. 972-976, June 2015.
[4] Daesung Kim and Jeongseok Ha, “Serial quasi-primitive BC-BCH codes
for NAND flash memories,” (ICC) 2016 IEEE International Conference
on, pp. 1-6.
[5] L. Lan, L. Zeng, Y. Tai, L. Chen, S. Lin, and K. Abdel-Ghaffar, “
Fig. 2. Example of a natural packing approach. Construction of quasi-cyclic LDPC codes for AWGN and binary erasure
channels: A finite field approach,”IEEE Transactions on Information
Theory, vol. 53, no. 7, Jul. 2007, pp. 2429V2458.
[6] X. H. Chen, J. Y. Kang, S. Lin V. Akella, “Memory System Opti-
mization for FPGA-Based Implementation of Quasi-Cyclic LDPC Codes
Decoder,”IEEE Transactions on Circuit and System, vol. 58, Issue: 1,
Jan. 2011, pp. 98-111.
[7] A.D. Houghton, The Enigneer’s Error Coding Handbook. Springer,
Boston, MA, 1997.
Fig. 3. Example of a proposed packing technique.

A Novel Data Packing Techniques For QC-LDPC Decoder Architecture Applied To NAND Flash Controller

Uploaded by

Copyright:

Available Formats

A Novel Data Packing Techniques For QC-LDPC Decoder Architecture Applied To NAND Flash Controller

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Novel Data Packing Techniques For QC-LDPC Decoder Architecture Applied To NAND Flash Controller

Uploaded by

Copyright:

Available Formats

8th Global Conference on Consumer Electronics (GCCE)

A Novel Data Packing Technique for QC-LDPC

Fig. 1. Example of a packing memory.

Fig. 5. A computing schedule of the proposed architecture

Fig. 3. Example of a proposed packing technique.

You might also like