LDPC CCSDS
LDPC CCSDS
SATELLITE TRANSPONDER
by
Dana R. Sorensen
of
MASTER OF SCIENCE
in
Electrical Engineering
Approved:
2020
ii
ABSTRACT
Transponder
by
LDPC (Low Density Parity Check) codes offer error-correcting capability that ap-
proaches the Shannon Limit. A set of short block-length LDPC codes has recently been
standardized by the CCSDS (Consultative Committee for Space Data Systems) for use with
satellite telecommands. This report describes and implements state-of-the art decoding al-
gorithms and architectures with varying degrees of serialization. These implementations are
determine the best decoder architecture in the context of the CCSDS code on the Iris small
developed and implemented for use with the decoders, and an LDPC encoder with a novel
(64 pages)
iv
PUBLIC ABSTRACT
Transponder
Dana R. Sorensen
allowing corruptions to be detected and corrected when the message is received. This is es-
pecially important for deep space satellite communications, since the long distances and low
signal power levels often cause message corruption. A very strong type of error-correction
coding known as LDPC codes was recently standardized for use with space communications.
This project implements the encoding and decoding algorithms required for a small satellite
radio to be able to use these LDPC codes. Several decoder architectures are implemented
and compared by their performance, speed, and complexity. Using these LDPC decoders
requires knowledge of the received signal and noise levels, so an appropriate algorithm for
estimating these parameters is developed and implemented. The LDPC encoder is imple-
mented using a flexible architecture that allows the entire standardized family of ten LDPC
CONTENTS
Page
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
PUBLIC ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
NOTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
ACRONYMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . .... ..... .... .... . .... ..... 1
1.1 Project Objectives and Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 History of LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Overview of LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Structure of LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.2 Performance of LDPC codes . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Overview of CCSDS LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.1 Telecommand Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 Telemetry Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Iris Radio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6.2 LDPC Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 ENCODER DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . . .... .... ..... . . . . . 15
2.1 Overview of Encoding Algorithms and Architectures . . . . . . . . . . . . . 15
2.2 Modifications to Traditional RCE . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Flexible Recursion Length . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Memory Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Generator Matrix ROM . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Working Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Code Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 DECODER DESIGN . . . . . . . . . . . . . . .... .... .... . .... .... ..... . . . . . 23
3.1 Overview of Decoding Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Check Node Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Variable Node Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
vi
LIST OF TABLES
Table Page
LIST OF FIGURES
Figure Page
1.2 Word Error Rates for Several CCSDS Uplink Coding Schemes (from [1]). . 7
1.3 Protograph for Short Block Length LDPC Codes (from [1]). . . . . . . . . . 8
1.4 Parity Check Matrix for the (128,64) LDPC Code (from [1]). . . . . . . . . 9
1.7 Power Efficiency versus Spectral Efficiency for Several CCSDS Codes (from [2]). 12
3.1 Tanner Graph Representation of an Example Parity Check Matrix (from [5]) 24
4.2 Comparison of BERs for LDPC Decoder Architectures (TODO: 4.0 wordsize) 46
x
NOTATION
Symbols
R code rate
G generator matrix
c codeword vector
yi received symbol
α signal amplitude
ACRONYMS
INTRODUCTION
and telemetry LDPC codes on the Iris Radio platform. Algorithms and their variants
are compared by implementation loss, speed, and complexity. Several architectures are
implemented, and the best decoder architecture in the context of the CCSDS codes is
Thousands of papers about LDPC codes have been published in the last few decades,
exploring code design, analysis, and decoding algorithms. As a result, many codes and
variations of decoding algorithms exist, but it is not always clear what the various trade-
In 2017, the Consultative Committee for Space Data Systems (CCSDS) released a revi-
sion to the TC Synchronization and Channel Coding standard [1], adding the specifications
for a new short block-length Low Density Parity Check (LDPC) code. This code outper-
forms the legacy BCH code in its error-correct capabilities, with the potential to function
The CCSDS has also defined several LDPC codes for use on satellite telemetry (space-
to-earth) links. These codes have performance comparable to turbo codes, but at higher
rates. Encoding algorithms are compared and the best candidate is implemented for the
Iris Radio.
The Iris Radio is a DSN-compatible small satellite transceiver designed by JPL and now
produced by SDL. LDPC encoding and decoding is a desired feature that is not currently
supported by the radio. The best algorithms resulting from this study will be selected for
The contributions of this project include a novel flexible-width design for recursive
2
convolutional encoders, a simplified SNR estimation algorithm for the decoder, and a serial
and architectures.
• Optimize existing algorithms for use with the CCSDS Telecommand LDPC code.
• Produce a functional LDPC decoder hardware implementation for use on the Iris
Radio.
Due to sufficient funds and time, the following objectives were added:
• Explore the design space for LDPC encoders, specifically relating to the CCSDS
telemetry codes.
Each algorithm and its associated architectures are evaluated based on the these three
main criteria:
reduced complexity. This is measured by generating bit error rate plots at various
signal-to-noise ratios.
• Speed. In general, higher speeds require higher FPGA utilization and may also sacri-
fice error-correcting performance. The speed constraint is also driven by the highest
uplink rate the Iris Radio supports, and is measured in clock cycles per bit per itera-
tion.
• FPGA utilization. An efficient algorithm and architecture uses fewer FPGA resources
These criteria are interdependent. In general, one is improved at the expense of the
other two. Appropriate bounds are set on each criteria based on what the Iris Radio
supports, and then a trade study is performed to balance the trade-offs and select the best
Figure 1.1 diagrams the methodology that was used to meet the project objectives. The
project started with an in-depth review of prior work and current state-of-the art LDPC
encoding and decoding algorithms and architectures. Several architectures were selected
for prototyping in MATLAB, and then implemented on a Xilinx FPGA platform. These
FPGA utilization. Based on these metrics, one encoding implementation and one decoding
Low Density Parity Check codes were originally proposed by Gallager in 1962 [8]. They
showed promise, but were too computationally complex for that time to be of much use. As a
result, they sat largely unnoticed until the mid 1990s with the work of MacKay, who showed
that LDPC codes can approach the Shannon limit of performance [9, 10]. The formulation
of the iterative Belief Propagation (BP) decoding algorithm and the advent of the modern
4
FPGA and VLSI technology has enabled practical LDPC encoding and decoding.
Improvements to the structure of the codes have also been made. The original codes
proposed by Gallager had regular weights (the number of 1s of each row and column were
constant) and were limited by poor error floor performance. New irregular LDPC codes
An LDPC code is any code having a sparse (low density) parity check matrix. The
sparseness of the matrix reduces the complexity of encoding and decoding, and has been
A Low Density Parity Check (LDPC) code is any error-correcting code with a sparse
parity check matrix, hence the name. The sparsity of the parity check matrix facilitates the
have been proven to approach the channel capacity [8] and are relatively simple to encode
and decode. These traits make the LDPC family of codes some of the best known codes.
In order to reach capacity, however, the block sizes must be sufficiently large.
An LDPC code has no required structure, and may be completely randomly generated.
In fact, such a randomly generated code in general has good minimum distance properties,
but can be computationally burdensome to encode and decode due to the lack of structure.
Subclasses of LDPC codes have been proposed, such as accumulate-repeat codes, quasi-
cyclic protograph-based codes, and more. The additional structure imposed allows for an
There are two classes of LDPC codes: regular and irregular. Regular codes are de-
scribed by a parity check matrix which contains a constant number of ones (a constant
weight) in each column, and a constant weight in each row. Irregular codes do not have
that constraint, and so may have any number of ones in each row or column (while still
5
being “sparse”). Research has shown that irregular codes generally perform better than
regular codes.
Since LDPC codes are quite unconstrained in structure, being any code with a sparse
parity check matrix, encoding is generally achieved by multiplication with the generator
matrix.
iterative decoding algorithm which uses posterior probabilities to inform estimates about
received bit values, which are then used to produce updated posterior probabilities, and so
on. These algorithms are nearly optimal and generally converge within a few iterations,
Because LDPC codes are so loosely defined, a wide range of trade-offs exists for LDPC
codes. Other codes may only perform well at high rates, for example, whereas good LDPC
LDPC codes have been shown to approach the channel capacity for large block sizes [10].
They have performance comparable to that of turbo codes, and with the advantage of
generally fewer computations per bit to decode [11]. They are especially good at higher
rates [3], and thus complement the good performance of Turbo codes at low rates.
One of the limitations of LDPC code performance is the presence of an error floor
region [12]. As the SNR increases, the performance of LDPC codes increases exponentially
until it hits this error floor. The location and size of the error floor is a major factor in
LDPC code design. Modern irregular LDPC codes perform much better in this regard than
A good LDPC code can have a large minimum distance, allowing a decoder to use the
codeword criterion as a stopping condition and to detect errors with an undetected error
1.4.3 Challenges
generally has a complexity of O(n2 ). Sparseness of the parity check matrix is not a sufficient
condition for sparseness of the generator matrix, so the full complexity is usually required.
Since their discovery, some efficient algorithms have been proposed to achieve nearly linear
encoding complexity for most codes [13]. This difficulty has also been overcome by designing
codes with structured generator matrices. For example, some codes have a block-circulant
Decoding is an iterative process, so research is being done to find the most efficient
Because LDPC codes perform better with larger block sizes, many codes have a large
parity check matrix (PCM). This can make it difficult to characterize some aspects of the
codes, since a brute force computer search (e.g. for minimum distance or codeword weights)
LDPC codes are an active research topic, and hundreds of papers are published each
year addressing code design, efficient decoding algorithms, error floor mitigation, and more.
There are several codes defined by the CCSDS for both telecommand (earth-to-space)
and telemetry (space-to-earth). Since the focus of this project is to implement a telecom-
mand decoder and telemetry encoder on a satellite, the parity check matrices (for decoding)
are the focus of the discussion on telecommand codes, and the generator matrices (for en-
In 2017, the Consultative Committee for Space Data Systems (CCSDS) published an
update to the Recommended Standard for TC Synchronization and Channel Coding [1].
This update added the option to use short block-length LDPC codes to encode uplink
telecommands sent to spacecraft. The specification defines two systematic rate 1/2 codes,
7
with codeword lengths (n = 128, k = 64) and (n = 512, k = 256). These new LDPC codes
are intended to replace the legacy BCH codes which, while easy to encode and decode, have
much worse performance than modern codes. There are a handful of papers describing the
The telecommand codes have short block lengths for LDPC codes, but they have a rel-
atively large minimum distance. They were chosen because with satellite communications,
the commands are generally short and known in advance. In contract, telemetry payloads
generally require a higher-throughput channel for downlink. Nevertheless, they achieve sig-
nificantly better performance than the legacy BCH codes used. Figure 1.2 compares the
Fig. 1.2: Word Error Rates for Several CCSDS Uplink Coding Schemes (from [1]).
8
Both codes are constructed from a protograph [14]. A small Tanner graph called a
protograph is constructed as shown in Figure 1.3, where circles represent variable nodes,
squares represent check nodes, and edges represent the bi-directional message passing paths.
Fig. 1.3: Protograph for Short Block Length LDPC Codes (from [1]).
The protograph is duplicated M times (M = 16 for the shorter code and M = 64 for the
larger) to produce larger codes. Each bundle of M edges is cut, cyclically permuted, then
reattached. The result is a code with a block-circulant parity check matrix with row weight
8 and column weights of 3 and 5. The full PCM for the n = 128 code is shown in Figure
1.4, where a dot represents a non-zero entry at that location. The cyclic permutations were
selected so that both codes have 1s on the main diagonal of the left half of the PCM. This
The underlying protograph structure is taken advantage of in the serial decoder (Section
3.4.3), and the structure of the cyclic permutations is taken advantage of in the partially-
Fig. 1.4: Parity Check Matrix for the (128,64) LDPC Code (from [1]).
In August 2011, the CCSDS standardized 10 systematic LDPC codes for satellite
telemetry [4, 11]. One is a high-rate R = 223/255 (approximately 7/8) code with a block
size of k = 7136. The others are a family of 9 codes with rates 1/2, 2/3, and 4/5, and
block sizes k=1024, 4096, and 16384. The various options of block size and code rate allow
engineers to manage trade-offs and select a code best matched to a particular mission.
A bit error rate comparison of CCSDS-defined telemetry codes is shown in Figure 1.5.
Note that Turbo and LDPC codes both perform close to capacity.
The advantage of the LDPC codes over Turbo codes is the higher code rate. This
means that more data can be transmitted with less overhead, suitable for higher data rates.
This of course comes at the penalty of coding gain, but this is a managed trade-off. Figure
1.6 shows the bit error rate in the symbol domain to accentuate the effect of the various
rates. Notice that the LDPC codes form a natural extension of the existing turbo codes,
and their bit error rate curves are spaced approximately 2 dB apart in the symbol domain.
The performance of these codes complements the existing Turbo codes standardized by
the CCSDS. The Turbo codes perform very well at low rates, whereas the LDPC codes are
10
defined at higher rates. This gives designers a larger solution space, depending on whether
Figure 1.7 illustrates this idea. If a system is power-limited, the codes on the far left
will perform well. If the system is bandwidth-limited, the codes at the bottom left will
perform well.
The construction of these codes is not covered here [11], since it was not taken into
account in the design of the encoder. However, these codes do have systematic block-
circulant generator matrices, the structure of which was taken advantage of in order to
Fig. 1.6: LDPC and Turbo Code Comparative Performance (from [3]).
The so-called ”rate 7/8” code is actually a rate R = 223/255 code formed from a
basic (8176, 7156) code, which is then expurgated, shortened, and extended to form the
recommended (8160, 7136) code (see Figure 1.8). These sizes were chosen because the rate
and dimensions match those of the CCSDS-defined Reed-Solomon Interleave-4 code. The
The basic code’s PCM is a 2×16 block array of 511×511 circulants. Each circulant has
a row and column weight of 2, resulting in the code having row weights of 32 and column
weights of 4.
This generator matrix is what is used for the encoder implementation for this project.
12
Fig. 1.7: Power Efficiency versus Spectral Efficiency for Several CCSDS Codes (from [2]).
The remaining CCSDS telemetry codes are a family of 9 AR4JA (Accumulate, Repeat-
by-4, Jagged Accumulate) codes [16, 17]. This family consists of every combination of
the block-lengths k = 1024, 4096, 16384 and code rates R = 1/2, 2/3, 4/5. The codes are
designed so that a constant block-length k can be used across several rates, allowing the
spacecraft’s command and data handling system to maintain consistent data frame sizes
The specification provides the PCMs for each code, from which the generator matrices
can be derived. The parity check matrices and generator matrices for these codes are block-
circulant, and the last block-column of symbols is punctured. The power-of-two block
The Iris radio is a small-satellite radio (about 0.5 U) that is compatible with NASA’s
DSN. Historically, only large satellites have been able to push beyond low-earth orbit.
However, with the rising popularity of small satellites about the size of a shoe box, more
and more missions are pushing the boundaries of what can be achieved using such a small
form factor. The Iris Radio supports these types of missions by allowing the satellites to
communicate with the earth at large distances and relatively low signal-to-noise ratios.
Iris is a software-defined radio, meaning that most of the signal processing is performed
in the digital domain on an FPGA. This allows new features, such as new coding standards,
1.6.1 History
Iris was originally designed by the Jet Propulsion Laboratory. It has been licensed to
the Space Dynamics Laboratory in Logan, Utah for fabrication, testing, and improvements.
Iris has flight heritage on the MarCO-A and MarCO-B satellites, which have the fame
of being the first interplanetary CubeSats. This pair of communication relays used Iris
radios to relay information about the entry, descent, and landing of NASA’s InSight Mars
Over time, Iris has evolved as new features and capabilities have been added. Each
new addition allows the radio to function in new situations and work for new missions and
new scenarios.
14
In an effort to improve the functionality of the Iris radio, a trade study was performed
which enumerated the capabilities of NASA’s Deep Space Network, then compared those
It was determined that the recently-standardized LDPC codes were a great candidate
for improvement, since they provide significant coding gains for telecommands, and higher
coding rates for telemetry. This improvement is also an example of the advantages of a
In order to meet the requirements of a wide range of customers, the goal is to support
CHAPTER 2
ENCODER DESIGN
The encoder targets the family of 9 AR4JA telemetry codes, along with the rate 7/8
telemetry code. Since the implementation is for a satellite radio, encoding of the telecom-
Although an LDPC parity check matrix is sparse, the generator matrix is generally dense,
so the encoding complexity is quadratic in the message length k. The memory requirements
to store a dense generator matrix can also be very expensive. In the case of the r = 1/2,
k = 16384 telemetry code, storing the non-systematic part of the generator matrix would
require nearly 270 million bits! Until 2001, this was the accepted way of encoding an LDPC
code, and was one of the primary challenges for LDPC codes.
In 2001, Richardson and Urbanke showed that by manipulating the parity check matrix
encoding complexity can be nearly linear, with the quadratic part dependent only on a small
gap g [13]. The larger the gap, the more difficult it is to encode. Many families of codes
have a gap of 0, meaning that encoding is truly linear in complexity. The CCSDS codes
unfortunately have a gap of (n − k)/2 [18], meaning that the RU decoding method has
no advantage over a direct encoder for the CCSDS codes. Additionally, the RU decoder
introduces latency and is not as flexible, so supporting all ten telemetry codes would be
very difficult.
Fortunately, the CCSDS telemetry codes have an important property that aids encod-
ing: the generator matrices are block-circulant. Shift registers can be used to significantly
reduce the complexity of the operation and the size of the generator matrix to be stored.
16
The natural implementation shown in Figure 2.1 can encode messages in real time with
serial inputs. It requires storage of the first row of each circulant block of the generator
matrix, (n − k)/m shift registers of length m, 2(n − k) memory cells, and (n − k) binary
A slightly better approach than the conceptually-simple shift register is to use a re-
cursive convolutional encoder (RCE) [19], as shown in Figure 2.2. Instead of shifting the
generator matrix bits, the shifts are built into the accumulator. This way, the generator
matrix bits are stationary and can be replaced by combinatorial function generators or
block memory. This architecture requires only (n − k) bits of memory and n − k binary
The implementation for this project uses a modified RCE approach, in which the
RCE instances are designed to have a flexible recursion length (shown in Figure 2.2 as
m), allowing the entire family of CCSDS codes to use the same RCEs on the FPGA. This
length is configurable at run-time. Due to the number of supported codes, block RAM is
17
used to store the first row of each circulant block of the generator matrices. The RCEs are
additionally modified to use block RAM to store intermediate results, partially serializing
The RCE architecture described in [18, 19] was modified to increase flexibility and
decrease resource utilization. These changes allow the same physical encoder to support all
10 CCSDS telemetry codes, and allow the designer to control the trade-off between speed
2.2.1 Serialization
In order to decrease resource utilization, the RCE encoding process is partially seri-
alized. Intermediate results are stored in block RAM and recalled for later use. If this
intermediate result memory were not used, the largest code (k=16384, r=1/2) would re-
quire 16384 flip flops for the shift registers. With this modification, the number of flip flops
tiated, where b is also the number of parity bit block-columns in the generator matrix. For
example, the AR4JA codes have b = 8. In a regular RCE implementation, all 8 cyclic block
each incoming message bit, the instantiated RCEs operate on the first f circulant block-
column(s) of the generator matrix. Then the f × m accumulated bits are written to working
memory, and the next f block-columns’ saved bits are read from the same memory. These
bits are used with the same input message bit to perform computations for the next f ∗ m
parity bits, then are written back to working memory. This procedure is repeated until all
of the parity bit computations have been performed for the input message bit.
This working memory architecture is shown in Figure 2.3. Once a set of binary multiply-
and-accumulate operations have finished, the results are written to the working memory and
the next set is loaded into the shift registers. This architecture allows the working memory
to be bypassed when encoding is finished in order to read out coded parity bits.
implementation can perform the (n − k) computations for each input bit in a single cycle.
The serialized implementation takes db/f e cycles to perform the same work. This shows
19
The second modification allows all 10 CCSDS telemetry codes to use the same encoder
hardware. The traditional LDPC RCE encoder implementation uses b = (n − k)/m distinct
Instead, this implementation uses a single shift register with the length (n − k) of the
largest supported code, then inserts optional recursion points at specific lengths. The result
is a runtime-configurable recursion length that can be modified to support the any of the
ten codes.
An example flexible shift register is shown in Figure 2.4, where each square is a D
flip-flop, with multiplexers interspersed along the chain. Depending on which multiplexer
inputs are selected, this circuit can be four shift registers of length 4, two shift registers of
length 8, or one shift register of length 16. Therefore this shift register could be used in
a flexible RCE of length 16, and would create run-time configurable RCEs with lengths of
m=4, 8, or 16.
Since all cycle lengths with the exception of the rate 7/8 code are powers of two, the
order to support the rate 7/8 code, additional multiplexers are inserted to enable a recursion
length of 511.
This section describes the memory organization used to create a low-resource encoder.
20
BRAM memory was chosen for the matrix ROM for several reasons:
• Even just storing the first row of each block-circulant, the memory requirements are
about 540 kb. This is too much for distributed memory, and function generators
• BRAM can have very wide data buses. We use memory with data widths of 512 to
2048 bits.
The ROM has a read width equal to the size of the flexible RCE, and as many addresses
To store a code, the generator matrix is reduced to the first rows of each circulant block.
Columns corresponding to systematic and punctured bits are discarded. The remaining
columns are bit-reversed so that the column of the first parity bit is on the right-hand side,
allowing the parity bits to be read out of the RCE shift register when computations are
finished. These rows are then broken into chunks equal to the maximum size of the flexible
RCE.
For some of the shorter block-length codes, the number of parity bits (n − k) may be
less than the maximum size of the flexible RCE. In this case, the row is right-aligned in
The working memory must be capable of reading and writing large data widths each
cycle, so dual-port BRAM was chosen. This could have been implemented using distributed
RAM, but Iris has a shortage of free LUTs and plenty of unused BRAM.
The AR4JA codes have a maximum serialization factor of 8, so the working memory
Table 2.1 breaks down the number of parity bits (n − k), cyclic length (m), and number
Code n−k m b
R=1/2, k=1024 1024 128 8
R=2/3, k=1024 512 64 8
R=4/5, k=1024 256 32 8
R=1/2, k=4096 4096 512 8
R=2/3, k=4096 2048 256 8
R=4/5, k=4096 1024 128 8
R=1/2, k=16384 16834 2048 8
R=2/3, k=16384 8192 1024 8
R=4/5, k=16384 4096 512 8
R=7/8, k=7136 1022 511 2
The original goal was to support all ten CCSDS telemetry codes. In order to keep
resource utilization low, for this implementation a flexible RCE of length 512 was created,
which enables support for the k=1024, 4096 codes, and the rate 7/8 code. In order to
support the larger block length codes, an RCE of at least length 2048 would have to be
used, which uses at least four times the resources. The generator matrices for the k=16384
codes are also rather large, and require many more BRAMs to store. Storing only the
k=16384 codes uses 28 BRAMs, whereas all the other codes together, including the rate
7/8 code, use just 8 BRAMS. Since the performance between the k = 4096 and k = 16384
codes is very similar, the decision was made to drop support for the k = 16384 codes for
the present.
The AR4JA codes have power-of-two cyclic block lengths m, and so are easy to im-
plement in the flexible RCE. They also have 4m punctured bits. These punctured bits are
simply not calculated, and the associated columns of the generator matrix are not stored.
The rate 7/8 codeword starts with 18 virtual zeros that are encoded, but not transmit-
ted. These virtual zeros are handled by simply starting the cycle counter at 18 instead of
22
0 for the rate 7/8 case. After encoding is finished, the encoder state machine outputs two
CHAPTER 3
DECODER DESIGN
Many LDPC decoding algorithms exist, and each algorithm has variations and param-
eters that can be used to decrease complexity or increase performance. When implemented
in hardware, decoding algorithms may have several feasible high-level architectures, each
with its own trade-offs [20]. Within each architecture, design choices such as the number
of bits used for soft decoding or the implementation of a particular function create an even
Some studies have been performed surveying the various LDPC decoding methods [20],
others describe methods of reducing complexity [21,22], and many resources exist describing
existing decoding algorithms [23, 24]. However, these resources do not optimize for specific
codes, but rather describe generalized algorithms. This chapter applies the results of this
Modern LDPC decoding algorithms are generally based on the belief propagation (BP)
soft symbols to calculate probabilities or log-likelihood ratios for each bit, then uses those
probabilities to update parity check information, then uses that parity check information to
update the bit probabilities, and so on. Eventually the algorithm converges on the closest
codeword.
This decoding process can be visualized at a high level using a Tanner graph, as shown
in Figure 3.1 with an example PCM. Each column of the PCM corresponds to a variable
(bit) node, and each row corresponds to a check node. The 1s in the PCM represent edges
The variable nodes are initialized using the channel inputs. Each check node calculates
24
Fig. 3.1: Tanner Graph Representation of an Example Parity Check Matrix (from [5])
the bit probabilities for each of the variable nodes connected to it. The variable nodes use
this updated information to produce new soft bit estimates. This process continues, with
information ”sloshing” back and forth between variable and check nodes until a stopping
criteria is met. For LDPC codes, the stopping criteria is usually the codeword criteria (i.e.
once a correct codeword is produced, decoding stops), but other stopping criteria have been
The details and derivation of the decoding process have been covered extensively [5],
and will not be related in their entirety here. One part of the decoding process that stands
out is the Check Node Update (CNU) rule used. The CNU governs how input messages
from the variable nodes are translated into output messages of the check nodes.
The optimal CNU in a cycle-free graph uses an operation known as the ”tanh rule”,
since it involves computations of the hyperbolic tangent and its inverse. This expensive
operation is computed between each of the check node inputs minus one, for each of the
inputs. When the tanh rule is used, the decoder is referred to as a sum product (SP)
decoder.
decoder. This decoder uses a CNU algorithm in which the smallest input (disregarding the
25
input to which the output belongs) is used as the output message, with its sign modified
based on the product of all the inputs. This decoder is much easier to implement, but comes
at the cost of a fraction of a dB loss in coding gain. The CNU implementation is discussed
in Section 3.2.
The Variable Node Update (VNU) is generally the same for each variation of the belief
propagation algorithm. The output messages each consist of a simple leave-one-out sum of
A further simplified decoding algorithm is the bit-flipping algorithm [26]. This decoder
uses hard symbols to achieve very high throughputs at the expense of performance. Because
performance is important and high throughputs are not required for this application, a bit-
For this project, the min-sum algorithm was select for implementation due to its speed
Each of this project’s decoder implementations uses the min-sum (MS) algorithm. The
where v 0 ∈ V(c) − {v}. This has two separable parts: the sign and the magnitude.
The output Lc→v message sign is the product of all the input messages’ signs, excluding
one. Since each message is represented as a two’s complement binary word, the sign bits of
the corresponding messages can be simply XORed together to produce the output sign.
The target code has row weights of eight. Performing an XOR on seven bits for each
of the eight messages would require 48 operations. Instead, an XOR operation is performed
between each of the eight inputs to find the total sign, then at the output this product is
XORed once more by the corresponding message input, resulting in just 15 operations (see
Figure 3.2).
26
Careful analysis of equation (3.1) shows that for a set of given input messages, the
output magnitude is either the minimum or sub-minimum of the inputs. Rather than
performing a minimum search eight times, a single minimum, sub-minimum search can be
With only eight inputs, this search can be performed in a tree-like fashion with three
layers as in Figure 3.3. This allows the entire CNU to operate in a single clock cycle.
The variable node update (VNU) is the same for both the MS and SP algorithm. The
update is described by
X
Lv→c = Lv + Lc0 →v (3.2)
c0 ∈C(v)−{c}
Each output message is simply the addition of all the input messages (excluding the
one from the destination check node) and the received channel LLR. Because addition is a
reversible operation (an added number can later be subtracted without loss of information)
27
this can be efficiently computed using an adder tree as in Figure 3.4. The target code has
row weights of three and five, so the tree needs at most three levels.
First, all the inputs, including the channel input, are summed using the adder tree.
Then, at each of the outputs the corresponding input message is subtracted from the total.
28
In order to avoid overflow, each level of the tree adds an extra bit of bit growth. The final
results are truncated with saturation back to 4 bits. The effect of truncation and saturation
3.4 Architectures
Several implementation architectures exist for each decoding algorithm. These archi-
tectures generally have similar performance, but trade between speed and resource usage.
For this project, three architectures were chosen to implement the min-sum decoding
algorithm. First is a fully parallel decoder (Section 3.4.1) which is extremely fast, but also
extremely resource intensive. Next is a partially parallel architecture (Section 3.4.2) which
(Section 3.4.3) which performs all computations serially, but uses the fewest resources.
The fully parallel decoder is the simplest conceptually. This decoder instantiates all
check nodes and all variables nodes in hardware, along with all the interconnections be-
tween them. Figure 3.5 shows how each variable node and check node are arranged. The
interconnections between them follow the PCM for the code, and the channel inputs are
used to initialize the variable nodes. Every VNU is computed simultaneously, followed by
every CNU. This architecture is the most expensive resource-wise, but is also the fastest.
The fully parallel decoder follows a flooding schedule. First, the variable nodes are
initialized with the received LLR values. Then, each parity check node runs simultaneously,
sending updated messages back to the variable nodes. Next, each variable node runs simul-
taneously, sending updated messages back to the check nodes. Both check node and variable
node updates take a single cycle each, so a full decoding iteration can be accomplished in
just two clock cycles. Compare this to a general CPU software implementation, where a
This speed comes at the expense of a high resource usage. Each check and variable node
takes up space in the FPGA fabric. Additionally, the routing between then is nontrivial.
29
As the size of the parity check matrix grows, the interconnect routing quickly increases to
become a congested mass of wires with large critical paths and large power dissipation.
Because of this, a fully parallel decoder is not suitable for large codes. Fortunately, for the
CCSDS short block length code (n = 128, k = 64), a fully parallel implementation fits in
the FPGA. The larger (n = 512, k = 256) code was not implemented.
One big implementation advantage of the fully parallel decoder is the simplicity of
memory organization. For the other two architectures implemented, memory management
takes up a significant fraction of the resource utilization. In the fully-parallel case, no extra
hardware is required to manage the memory, since each node contains just a single register
The fully parallel implementation used the min-sum CNU algorithm. One big downside
30
of having so many check nodes instantiated, is that a single change to the check node update
computation has an enormous impact on resource utilization. For example, switching from
the MS algorithm to the SP algorithm would incur the difference in resource utilization per
parity check matrix. The node interconnects can be re-routed by reprogramming the FPGA,
but at run-time they are fixed, with no flexibility. If a family of codes shares parts of the
the same PCM this can be exploited, but the CCSDS telecommand codes do not share any
The partially-parallel decoder takes the middle ground of the three architectures. Like
the fully-parallel architecture, several check nodes and variable nodes are instantiated in
parallel. However, only a chosen subset of the total number is instantiated. This gives the
designer more control over the speed/resource usage trade-off, since the number of nodes
This implementation closely followed the one described in [6]. Figures 3.6 and 3.7 show
the variable node and check node update paths, respectively. Messages are stored in block
This architecture follows a different schedule than the other two. Rather than updating
the check nodes, then updating the variable nodes separately afterwards, check nodes and
variable nodes are updated simultaneously, block by block. This is possible because the of
the intermediate message storage between updates. The result is that some of the check
node updates’ messages are newly updated, while others are not. While not ideal, this
allows check node and variable node updates to happen together, resulting in a 2x speedup.
The main challenge of the partially-parallel decoder is the memory access architecture.
In the case of the CCSDS codes, the parity check matrices are block-circulant, and this can
be taken advantage of in the memory architecture. For this reason, this type of architecture
31
is generally only used for codes with block-circulant parity check matrices, otherwise the
complexity of the memory accesses mitigates any benefit of the parallelism. For example,
if we let the number of instantiated nodes equal the size of a circulant sub-matrix of the
PCM, all the associated messages can be read in blocks of 16 (or whatever the circulant size
is). Block RAMs are used to efficiently store message results and provide wide data buses.
cyclic codes at runtime. Only the read and write addresses and the interleaver barrel shifter
settings change between codes. In this project, however, only the (n = 128, k = 64) code was
implemented, with 16 VNUs and 8 CNUs (since there are half as many CNU computations
to perform). The result can perform a full iteration in just n/16 = 8 cycles.
The architecture described in [6] was designed for block-circulant PCMs where each
32
circulant submatrix is a permutation matrix (has a column and row weight of one). The
The fully serial decoder is on the opposite end of the spectrum from the fully parallel
decoder. It instantiates just a single check node and a single variable node. This decoder
also follows a flooding schedule, but in this case each update is computed sequentially
in time. First, each variable node is initialized with the received LLRs, one at a time.
Next, each parity check node runs sequentially, reading the required variable node messages
from memory, then writing its outputs back into memory. Then, each variable node runs
sequentially, reading check node messages from memory and writing output messages back
into memory. Two memory banks are needed: one for check node messages and one for
This specific implementation was created specifically for the CCSDS code. The lessons
33
learned from implementing and modifying the partially parallel decoder were used to design
the memory architecture. The first implementation used a serial memory architecture which
was slow and had a high overhead. The second implementation documented here is much
improved and runs several times faster, while requiring significantly fewer resources.
Figure 3.8 shows an overview of the architecture. The major design challenge of the
34
fully serial decoder, like the partially parallel one, is in the memory architecture. How does
a check node know which variable node messages to use as input, and how does it know
The implemented decoder takes advantage of the protograph structure of the parity
check matrix (see Figure 1.3). The variable node memory bank has 8 BRAM memories for
the 8 variable nodes in the protograph, and the check node memory bank has 4 BRAM
memories for the 4 check nodes in the protograph. These BRAMs have read port widths
sufficient to read all the messages required for a single node update simultaneously, and
write port widths to write a single message at a time. For the CCSDS code and 4-bit
messages, the VMEM (variable node memory) BRAMs have read ports of 16 bits (four
messages) and write ports of 4 bits. The CMEM (check node memory) BRAMs have read
Variable and check node updates are pipelined. Each message update requires three
cycles, but the pipelined throughput is one update per cycle. For example, in order to
1. In the first cycle, all the required input messages Lc0 →v , where c0 ∈ C(v) are read
2. In the second cycle, the chosen BRAM output is selected using a multiplexer. In this
implementation, the 16-bit data contains four 4-bit messages, which are input to the
VNU, along with the channel LLR input. The VNU performs its update as described
in Section 3.3.
3. In the third cycle, the VNU outputs Lv→c0 . Each of the messages is written to a
The pipelining means that while messages from variable node v are being written into
CMEM, the VNU is operating on messages for variable node v+1, and messages for v+2 are
being read from VMEM. The operations for a check node update follow those of a variable
node update.
35
3.9. Filled squares represent non-zero entries of the PCM. This matrix is smaller than the
CCSDS-defined PCMs, but will be used to demonstrate the memory organization. Each
check-node memory block (labeled CMEM X ) holds one block-row of the PCM, and each
variable-node memory block (labeled VMEM X ) holds one block-column of the PCM. Each
read address of a VMEM block contains a single column’s worth of messages: all the
For example, while updating variable node v = 10, the highlighted column in Figure
3.9 is read from address v mod 8 = 2 of VMEM bv/8c = 1. The four messages contained
therein are input to the VNU. The four VNU output messages are then written to address
v = 10 of each of the CMEMs. Each CMEM is aware of the cyclic shifts of the submatrices
it contains, and internally adjusts its write address using simple modulo-8 addition prior to
writing.
The eight messages contained therein are input to the CNU. The eight CNU output messages
36
are then written to address c = 1 of each of the VMEMs. Each VMEM is aware of the
cyclic shifts of the submatrices it contains, and internally adjusts its write address using
The actual telecommand PCMs defined by the CCSDS are larger, and have extra
properties. Some of the cyclic blocks are all zeros, and there is an extra diagonal of ones
To accommodate the occasional zero sub-matrix, each input to the variable node update
module can be zeroed out using a multiplexer. To accommodate the extra ones on the
identity, an extra BRAM is used. Since the ones are on the identity, their read and write
addresses are the same for both the variable node and the check node updates. And since
each update is performed sequentially, the same BRAM can be used for the check node and
variable node messages (each is overwritten by the other). The output messages of this
identity memory are multiplexed into the check node update inputs when appropriate.
The variable node hardware accommodates five input messages. When an update
requiring only three messages is computed, the extra two inputs are zeroed out by the mul-
The fully-serial decoder has many advantages. First, it requires very few resources.
There is the small overhead of the memory access architecture, but there is only a single
The decoder supports both CCSDS telecommand codes with very little modification.
Both the (n = 128, k = 64) and (n = 512, k = 256) codes have the same structure. The
only difference is the size of the submatrices (m = 16 for the shorter code, and m = 64 for
the longer) and the cyclic permutations of each of the submatrices. Since the BRAMs hold
only the messages (and not the blank spaces in the sparce PCM), the port widths are the
same. Both sets of permutations are stored in the CMEM and VMEM.
Another advantage of the fully-serial architecture is that the variable node or check
node update designs have much less effect on the overall resource utilization. In the fully-
37
parallel case, any increased cost for the check node update was multiplied n − k times. For
the fully-serial decoder, the increase is only incurred once. This allows for a better check
node design and better performance without greatly increasing resource usage.
The obvious drawback to a serial decoder is the slower speed. Even operating at a
single cycle per update, it still requires at least 2n − k cycles to complete a single iteration.
Because the CCSDS codes have short block lengths, the serial decoder is still feasible.
channel inputs. The received signal level for bit i is yi = αbi + zi , where α is the signal
√
amplitude (|α| = Ec ), bi is either +1 or −1 based on the transmitted bit value, and zi is
AWGN noise with power σ 2 = N0 /2. The associated LLR [5] is calculated by
2α
Li = yi (3.3)
σ2
One simple and effective way of estimating the SNR is by comparing received symbols to
an expected pilot sequence. The CCSDS Communications Link Transmission Unit (CLTU)
has a 64-bit start sequence when used with LDPC coding [1]. If we assume we have frame
64 64 64
1 X 1 X 2 1 X
α̂ = bi yi = αbi + bi zi = α + bi zi ≈ α (3.4)
64 64 64
i=1 i=1 i=1
because b2i = 1 and the last term is approximately zero due to the effect of averaging the
noise.
The noise variance can also by estimated by observing that zi ≈ yi − α̂bi , so that
64 64
1 X 2 1 X
σ̂ 2 ≈ zi ≈ (yi − α̂bi )2 (3.5)
64 64
i=1 i=1
38
One of the difficulties with using this approach in a streaming architecture is the
frame synchronization requirement. By the time the pilot symbols are known to have been
received, the codeword is being received. This means there must be a delay while SNR
estimation is performed, and memory to store the pilot symbols for later use.
The calculations of equations (3.4) and (3.5) can be simplified by assuming that no
received symbols have enough noise to cross the zero decision threshold. This assumption
is valid only at high SNRs, but analysis will show that it is good enough for this situation.
bi yi ≈ |yi | (3.6)
64
1 X
α
e= |yi | (3.7)
64
i=1
which can be estimated using a simple absolute value circuit and an accumulate-and-dump
64 64
1 X 1 X 2
e2 = ebi )2 = e2 b2i
σ (yi − α yi − 2e
αyi bi + α
64 64
i=1 i=1
64 64 64
1 X 2 1 X 2 1 X 2
= yi − 2e
α |yi | + α
e = e2
yi − α (3.8)
64 64 64
i=1 i=1 i=1
which can be obtained from a multiplier, an accumulate-and-dump filter, and a sum. These
combined improvements can be used to create a very efficient SNR estimator that works
for higher SNRs, shown in Figure 3.10. In addition, the estimator no longer requires frame
But what about the low SNR case? The CCSDS telecommand codes have a threshold of
approximately 3.5 dB. Under simulation, the average estimate error and standard deviation
of the estimate error are plotted for various values of Eb/N0 in 3.11.
We can see that the estimator begins to break down at about 5 dB. Lower than 0 dB
39
it has a significant bias in the error. How much does the bias at 3.5 dB and above affect
the decoder? Bit error rate curves were generated by simulation in MATLAB and shown
40
• A worst-case estimate using the simplified algorithm described above. The SNR es-
There is a small fraction of a dB performance loss using this estimator. However, this
is the worst case estimate, meaning that the estimator will always perform at least as good
as indicated in the figure. This shows that the simplified estimator is sufficient for this code.
41
CHAPTER 4
This chapter discusses the testing methods and results for both the telemetry encoder
The encoder performance is measured using the metrics of correctness, speed, and size.
In order to test correctness, the encoder was first tested with HDL simulation using
Modelsim. After basic functionality was demonstrated with a sample message block of
alternative ones and zeros, a full suite of tests was designed to demonstrate full functionality
The simulation testing suite consisted of ten tests for each supported code. Each test
consisted of a single codeword, designed for easy debugging and thorough testing of edge
• All ones
• All zeros
• Alternating 0s and 1s
• Alternating 1s and 0s
Once the correct encoding of messages was shown, the encoder was integrated into the
Iris Transponder. A full-scale downlink test was then performed for each of the supported
codes at various data rates. Each test consisted of encoding tens of thousands of random
42
codewords encoded, modulated, and transmitted, then received, demodulated, and decoded.
The tests operated at a high SNR and the decoder was monitored for the number of bit
errors in the messages. The pass criteria was for no bit errors to be present before decod-
ing, indicating that each message block was properly encoded. These tests were repeated
extensively at several data rates for each of the seven supported codes.
4.1.2 Utilization
The encoder utilization depends on the supported codes and the degree of serialization
as described in Section 2.2.1. The final configuration used for Iris supported all the 1024
and 4096 block-length codes, as well as the 7136-bit rate 7/8 code. The flexible RCE had
In order to support all the length 16384 codes, the RCE length would have to be at
4.1.3 Speed
The final implementation used a flexible RCE of length 512. Because each code has a
different block-cyclic length, the degree of serialization required is different for each code.
The degree of serialization controls the maximum speed for a given clock rate.
The codes with fewer than 512 parity bits fit inside the 512 RCE bits, and so require
a single clock cycle per input bit. The longest code has eight blocks of 512 parity bits, and
so requires eight clock cycles per input bit. Given the Iris system clock frequency of 50
43
MHz, and the required number of clock cycles for the state machine to operate correctly,
the maximum operating speed in coded Mbps is also listed in Table 4.2. At the time of this
writing, the Iris transponder has an output symbol rate limit of 6.25 Msps, so each of the
codes perform at the highest Iris speeds, except for the R=1/2, k=4096 code.
Due to the implementation of the controlling state machine, the smallest number of
cycles between coded bits is 5, resulting in the maximum 10 Mbps speeds listed.
4.1.4 Trade-offs
The only variable parameters for the encoder are the speed and utilization. The en-
coding speed could be increased linearly by increasing the size of the RCE used (doubling
4.1.5 Discussion
The final result is an encoder supporting seven of the ten LDPC telemetry codes at
the maximum Iris rate, while using less than one percent of the FPGA resources.
The flexibility of the design means that future improvements, such as supporting the
k=16384 codes, or increasing the maximum speed, are nearly trivial to implement and come
Improvements to the controlling state machine could also be made to add some addi-
The metrics of the decoder implementations are correctness, performance, speed, and
size. These attributes are compared across the three implemented architectures in the
following sections.
The decoder was first tested in simulation for correctness. Python was used to generate
100 random codewords at a configurable SNR. These codewords were then decoded in
the simulation. At high SNRs, the decoders were verified to detect correct codewords
immediately. At lower SNRs, the expected distribution of iteration counts was observed.
Next, a hardware testbed was created. The ML605 Virtex6 development board was
used, since it has an FPGA of the same family as the Iris radio. An interface was created to
allow a host computer to transfer generated LLR inputs to the decoder over ethernet. The
decoded results are transferred back to the host computer using the same ethernet interface.
This hardware testbed was used to generate bit error rate curves and utilization mea-
The decoder implementation must truncate LLR messages to a finite bit width. The
choice of bit width directly affects the decoder’s performance [30]. In addition to the bit
width of the messages, the number of fractional bits included in that width also affect the
performance.
In order to characterize the effects of finite word size, the fully-parallel decoder was
tested at various bit widths and fractional lengths and BER curves were generated for each
version, shown in Figure 4.1. The notation used is X.Y, where X refers to the total number
of bits in the two’s complement representation, and Y refers to the number of fractional
bits included. For example, an LLR using a 5.1 representation can represent values from
-16 to 15.5, with an LSB resolution of 0.5. Increasing the fractional size to 5.2 represents
The two MATLAB reference series in Figure 4.1 used floating point arithmetic. The
best performing finite representation is 5.1, which performs nearly as well as the floating
point representation. The next best is 4.0, which suffers an approximately 0.1 dB loss
compared to the floating point MS reference. As a result of this study, 4.0 was chosen as
Each of the architectures was implemented using an LLR word size of 4.0. The perfor-
The performance reference is a MATLAB simulation using the SP algorithm and float-
ing point LLRs. The fully parallel decoder operates with an approximately 0.4 dB loss
compared to the reference, and the partially parallel decoder operates with approximately
0.7 dB loss. The extra loss from the partially parallel decoder is likely due to its non-optimal
schedule.
46
Fig. 4.2: Comparison of BERs for LDPC Decoder Architectures (TODO: 4.0 wordsize)
The resource utilization and maximum throughput for the three implementations are
shown in Table 4.3. The throughput is measured as the maximum number of coded bits
that can be decoded per second assuming an average of 25 iterations per codeword. The
4.2.4 Discussion
All three architectures performed well, and each has its own strengths and weaknesses.
As far as throughput per resource used is concerned, the fully parallel is the clear winner.
47
The fully parallel decoder requires only 2 clock cycles per iteration. The partially
parallel decoder needs 8, and the fully serial needs 2n − k cycles per iteration.
The fully serial architecture was chosen for use on Iris for several reasons. First, it uses
very few resources compared to the other architectures. In total it uses approximately 1% of
the FPGA resources. Second, although it has much lower throughput than the others, Iris
is currently limited to a 1 Mbps uplink symbol rate, so it performs faster than is required
for Iris. It also performs better than the partially parallel implementation.
48
CHAPTER 5
CONCLUSION
5.1 Discussion
All of the objectives of this project were met. The design spaces of LDPC encoders and
decoders were explored, especially relating to the CCSDS-defined codes. An LDPC encoder
and several LDPC decoders were implemented with various trade-offs. The trade-offs were
considered and specific implementations were chosen for inclusion on the Iris radio.
The original goal of supporting all 10 LDPC telemetry codes was met with an encoder
supporting 7 of the 10 codes. The k = 16384 codes were deemed too costly to implement,
given that they only perform a small fraction of a dB better than the k = 4096 codes.
Adding support for the k = 16384 codes would require more than 4x the resources than the
The encoder and decoder both meet the requirements for integration into the Iris radio
platform. However, further work could be done to improve performance, increase speed,
First, the decoder remains to be integrated into Iris. The encoder has been fully
integrated and extensively tested with promising results. The decoder will be integrated at
a future date.
The encoder state machine is not fully optimized and consequently the maximum
throughput is limited. Decreasing the number of cycles required between input bits would
have a fairly significant increase in maximum speed, especially for the codes already requir-
Work could also be done to pipeline the encoder so that new messages can be computed
49
while the parity bits of the previous message are being transmitted. The speed under this
The AR4JA structure of the telemetry codes was not taken advantage of for the encoder.
Further work may find a way to exploit this structure to create a very fast, low-resource
encoder. However, such an encoder would not be compatible with the rate 7/8 C2 code,
The implemented decoders use a flooding or modified flooding schedule. Other decoding
schedules exist, such as the serial schedule [31] and layered decoding [32–34], which can
purportedly converge up to 50% more quickly, meaning that less logic is required to achieve
the same throughput and error performance. Alternatively, the CNU and VNU operations
of the flooding schedule could be made to overlap in the regions without memory access
The decoder message memory could be optimized. There are several schemes presented
in the literature that take advantage of the fact that the min-sum check node output mes-
sages have one of two magnitudes [36, 37]. Only those two magnitudes can be stored, along
with the signs, then retrieved later. This scheme could reduce memory usage significantly.
The only CNU algorithm implemented was the min-sum algorithm. Many alternative
CNU algorithms exist, such as min-sum offset and scale corrections [39,40], the product-sum
algorithm [21], and more. Increases of the CNU utilization have a significant negative effect
on utilization for the parallelized architectures, but for the serial architecture, the increase
is only counted once. Performance could improve by at most 0.4 dB with a different CNU
algorithm.
50
REFERENCES
[1] TC Synchronization and Channel Coding, ser. Blue Book, No. 3, Consulative
Committee for Space Data Systems (CCSDS) Recommendation for Space Data
System Standard 231.0-B-3, Sep. 2017. [Online]. Available: https://public.ccsds.org/
Pubs/231x0b3.pdf
[4] TM Synchronization and Channel Coding, ser. Blue Book, No. 3, Consulative
Committee for Space Data Systems (CCSDS) Recommendation for Space Data
System Standard 131.0-B-3, Sep. 2017. [Online]. Available: https://public.ccsds.org/
Pubs/131x0b3e1.pdf
[5] T. K. Moon, Error Correction Coding: Mathematical Methods and Algorithms. New
York, NY, USA: Wiley-Interscience, 2005.
[7] Y. Wang, D. Liu, L. Sun, and S. Wu, “Real-time implementation for reduced-
complexity LDPC decoder in satellite communication,” China Communications,
vol. 11, no. 12, pp. 94–104, Dec. 2014.
[10] D. MacKay and R. Neal, “Near Shannon limit performance of low density parity check
codes,” Electronics Letters, vol. 33, no. 6, pp. 457–458, Mar. 1997.
[12] Y. He, J. Yang, and J. Song, “A survey of error floor of LDPC codes,” in 2011 6th
International ICST Conference on Communications and Networking in China (CHI-
NACOM), Aug. 2011, pp. 61–64.
51
[15] Next Generation Upink, ser. Green Book, No. 1, Consulative Committee for
Space Data Systems (CCSDS) Informational Report 230.2-G-1, Jul. 2014. [Online].
Available: https://public.ccsds.org/Pubs/230x2g1.pdf
[17] D. Divsalar, S. Dolinar, and C. Jones, “Low-rate LDPC codes with simple protograph
structure,” in Proceedings. International Symposium on Information Theory, 2005.
ISIT 2005., Sep. 2005, pp. 1622–1626, iSSN: 2157-8117.
[18] D. Theodoropoulos, N. Kranitis, and A. Paschalis, “An efficient LDPC encoder ar-
chitecture for space applications,” in 2016 IEEE 22nd International Symposium on
On-Line Testing and Robust System Design (IOLTS), Jul. 2016, pp. 149–154, iSSN:
1942-9401.
[19] K. Andrews, S. Dolinar, and J. Thorpe, “Encoders for block-circulant LDPC codes,” in
Proceedings. International Symposium on Information Theory, 2005. ISIT 2005., Sep.
2005, pp. 2300–2304, iSSN: 2157-8117.
[21] X.-Y. Hu, E. Eleftheriou, D.-M. Arnold, and A. Dholakia, “Efficient implementations
of the sum-product algorithm for decoding LDPC codes,” in GLOBECOM’01. IEEE
Global Telecommunications Conference (Cat. No.01CH37270), vol. 2, Nov. 2001, pp.
1036–1036E vol.2, iSSN: null.
[27] C.-L. Wey, M.-D. Shieh, and S.-Y. Lin, “Algorithms of Finding the First Two Minimum
Values and Their Hardware Implementation,” IEEE Trans. Circuits Syst. I, vol. 55,
no. 11, pp. 3430–3437, Dec. 2008.
[32] D. Hocevar, “A reduced complexity decoder architecture via layered decoding of LDPC
codes,” in IEEE Workshop onSignal Processing Systems, 2004. SIPS 2004., Oct. 2004,
pp. 107–112, iSSN: null.
[35] J.-H. Yoon and J. Park, “An Efficient Memory-Address Remapping Technique for
High-Throughput QC-LDPC Decoder,” Circuits, Systems, and Signal Processing,
vol. 33, no. 11, pp. 3457–3473, Nov. 2014. [Online]. Available: https:
//doi.org/10.1007/s00034-014-9808-3
[37] A. Prabhakar and K. Narayanan, “A memory efficient serial LDPC decoder architec-
ture,” in Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics,
Speech, and Signal Processing, 2005., vol. 5, Mar. 2005, pp. v/41–v/44 Vol. 5, iSSN:
2379-190X.
[38] X. Chen, J. Kang, S. Lin, and V. Akella, “Memory System Optimization for FPGA-
Based Implementation of Quasi-Cyclic LDPC Codes Decoders,” IEEE Trans. Circuits
Syst. I, vol. 58, no. 1, pp. 98–111, Jan. 2011, conference Name: IEEE Transactions on
Circuits and Systems I: Regular Papers.
[39] M. Xu, J. Wu, and M. Zhang, “A modified Offset Min-Sum decoding algorithm for
LDPC codes,” in 2010 3rd International Conference on Computer Science and Infor-
mation Technology, vol. 3, Jul. 2010, pp. 19–22.