0% found this document useful (0 votes)
17 views

VLSI Architectures For The MAP Algorithm

Uploaded by

Rama Muni Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

VLSI Architectures For The MAP Algorithm

Uploaded by

Rama Muni Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 51, NO.

2, FEBRUARY 2003 175

Transactions Papers

VLSI Architectures for the MAP Algorithm


Emmanuel Boutillon, Warren J. Gross, Student Member, IEEE, and P. Glenn Gulak, Senior Member, IEEE

Abstract—This paper presents several techniques for the very information were not evident. In this paper, we describe tech-
large-scale integration (VLSI) implementation of the maximum a niques for implementing the MAP algorithm that are suitable
posteriori (MAP) algorithm. In general, knowledge about the im- for very large-scale integration (VLSI) implementation.
plementation of the Viterbi algorithm can be applied to the MAP
algorithm. Bounds are derived for the dynamic range of the state The main idea in this paper can be summarized as extending
metrics which enable the designer to optimize the word length. well-known techniques used in implementing the Viterbi al-
The computational kernel of the algorithm is the Add- MAX gorithm to the MAP algorithm. The MAP algorithm can be
operation, which is the Add-Compare-Select operation of the thought of as two Viterbi-like algorithms running in opposite
Viterbi algorithm with an added offset. We show that the critical
path of the algorithm can be reduced if the Add- MAX
operation
directions over the data, albeit with a slightly different compu-
tational kernel.
is reordered into an Offset-Add-Compare-Select operation by
adjusting the location of registers. A general scheduling for the This paper is structured in the following way. Section II is a
MAP algorithm is presented which gives the tradeoffs between brief description of the MAP algorithm in the logarithmic do-
computational complexity, latency, and memory size. Some of main. Section III studies the problem of internal representation
these architectures eliminate the need for RAM blocks with of the state metrics for a fixed-point implementation. Section IV
unusual form factors or can replace the RAM with registers.
These architectures are suited to VLSI implementation of turbo focuses on efficient architectures to realize a forward (or back-
decoders. ward) recursion. The log-likelihood ratio (LLR) calculation is
also briefly described. Section V proposes several schedules for
Index Terms—Forward–backward algorithm, MAP estimation,
turbo codes, very large-scale integration (VLSI), Viterbi decoding. the forward and backward recursions. As the computations of
the forward and the backward recursions are symmetrical in
time (i.e., identical in terms of hardware computation), only the
I. INTRODUCTION forward recursion is described in Sections III and IV.

I N RECENT YEARS, there has been considerable interest


in soft-output decoding algorithms; algorithms that provide
a measure of reliability for each bit that they decode. The
II. MAP ALGORITHM
A. Description of the Algorithm
most promising application of soft-output decoding algorithms
The MAP algorithm is derived in [3] and [6] to which the
are probably turbo codes and related concatenated coding
reader is referred to for a detailed description. The original
techniques [1]. Decoders for these codes consist of several
derivation of the MAP algorithm was in the probability domain.
concatenated soft-output decoders, each of which decodes part
The output of the algorithm is a sequence of decoded bits
of the overall code and then passes “soft” reliability information
along with their reliabilities. This “soft” reliability information
to the other decoders. The component soft-output algorithm
is generally described by the a posteriori probability (APP)
prescribed in the original turbo code paper [1] is usually known
. For an estimate of bit ( 1/ 1) having received
as the maximum a posteriori (MAP), forward–backward (FB),
symbol , we define the optimum soft output as
or Bahl–Cocke–Jelinek–Raviv (BCJR) algorithm [2], [3]. This
algorithm, originally described in the late 1960’s, was generally
(1)
overlooked in favor of the less complex Viterbi algorithm [4],
[5], moreover, applications taking advantage of soft-output
which is called the log-likelihood ratio (LLR). The LLR is a
convenient measure, since it encapsulates both soft and hard
bit information in one number. The sign of the number corre-
Paper approved by R. D. Wesel, the Editor for Coding and Communication
Theory of the IEEE Communications Society. Manuscript received September sponds to the hard decision while the magnitude gives a relia-
1, 1999; revised July 13, 2001 and July 2, 2002. This paper was presented in bility estimate. The original formulation of the MAP algorithm
part at the 5’eme Workshop AAA sur l’Adequation Algorithme Architecture, requires multiplication and exponentiation to calculate the re-
INRIA Rocquencourt, France, January 26–28, 2000.
E. Boutillon is with L.E.S.T.E.R, Université de Bretagne Sud, 56325 Lorient quired probabilities.
Cedex, France (e-mail: [email protected]). In this paper, we consider the MAP algorithm in the loga-
W. J. Gross and P. G. Gulak are with the Department of Electrical and rithmic domain as described in [7]. The MAP algorithm, in its
Computer Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada
(e-mail: [email protected]; [email protected]). native form, is challenging to implement because of the expo-
Digital Object Identifier 10.1109/TCOMM.2003.809247 nentiation and multiplication. If the algorithm is implemented in
0090-6778/03$17.00 © 2003 IEEE
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:44:03 UTC from IEEE Xplore. Restrictions apply.
176 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 51, NO. 2, FEBRUARY 2003

the logarithmic domain like the Viterbi algorithm, then the mul- The trellis termination condition requires the entire block to be
tiplications become additions and the exponentials disappear. received before the backward recursion can begin.
Addition is transformed according to the rule described in [8]. • Soft-Output Calculation. The soft output, which is called the
Following [9], the additions are replaced using the Jacobi loga- LLR, for each symbol at time is calculated as
rithm

(2) (8)
which is called the operation, to denote that it is essen-
tially a maximum operator adjusted by a correction factor. The
where the first term is over all branches with input label 1, and
second term, a function of the single variable , can be pre-
the second term is over all branches with input label 1.
calculated and stored in a small lookup table (LUT) [9]. The
The MAP algorithm, as described, requires the entire mes-
computational kernel of the MAP algorithm is the Add–
sage to be stored before decoding can start. If the blocks of data
operation, which is analogous, in terms of computation, to the
are large, or the received stream continuous, this restriction can
Add–Compare–Select (ACS) operation in the Viterbi algorithm
be too stringent; “on-the-fly” decoding using a sliding-window
adjusted by an offset known as a correction factor. In what fol-
technique has to be used. Similar to the Viterbi algorithm, we
lows, we will refer to this kernel as ACSO (Add–Compare–Se-
can start the backward recursion from the “all-zero vector”
lect–Offset).
(i.e., all the components of are equal to zero) with data { },
The algorithm is based on the same trellis as the Viterbi al-
from down to . iterations of the backward recur-
gorithm. The algorithm is performed on a block of received
sion allows us to reach a very good approximation of
symbols which corresponds to a trellis with a finite number of
(where is a positive additive factor) [10], [11]. This additive
stages . We will choose the transmitted bit from the set
coefficient does not affect the value of the LLR. In the following,
of { }. Upon receiving the symbol from the additive
we will consider that after cycles of backward recursion, the
white Gaussian noise (AWGN) channel with noise variance ,
resulting state metric vector is the correct one. This property can
we calculate the branch metrics of the transition from state to
be used in a hardware realization to start the effective decoding
state as
of the bits before the end of the message. The parameter is
called the convergence length. For on-the-fly decoding of non-
(3)
systematic convolutional codes as discussed in [10] and [11],
five to ten times the constraint length was found to lead only to
where is the expected symbol along the branch from
marginal signal-to-noise ratio (SNR) losses. For turbo decoders,
state to state . The multiplication by can be done
due to the iterative structure of the computation, an increased
with either a multiplier or an LUT. Note that in the case of a
value of might be required to avoid an error floor. A value of
turbo decoder which uses several iterations of the MAP algo-
is reported in [12] for a recursive systematic code with
rithm, the multiplication by need only be done at the
a constraint length of five. In practice, the final value of has
input to the first MAP algorithm [6].
to be determined via system simulation and analysis of the par-
The algorithm consists of three steps.
ticular decoding system at hand.
• Forward Recursion. The forward state metrics are recur-
sively calculated and stored as
B. Upper Bounds for
All the following upper bounds are derived from the defini-
(4) tion of in (2):
The recursion is initialized by forcing the starting state to state
0 and setting (9)

For practical implementation, one can notice that, due to the


(5) finite precision of the hardware implementation, the function
gives a zero result as soon as is large
• Backward Recursion. The backward state metrics are recur- enough. For example, if the values are coded in fixed precision
sively calculated and stored as with three binary places (a quantum of 0.125), then
, thus it will be rounded to 0. In
(6) that case, the computation of the offset of the operator
can be performed with two pieces of information: a Boolean
The recursion is initialized by forcing the ending state to state 0 (for zero) that indicates if is above or equal to the
and setting first power of two greater than 2.5, i.e., four. If is true, then
the offset is equal to 0. If not, its exact value is computed with
the five least significant bits of . The maximum number
(7) is , which will be quantized to 0.75, i.e., the width of the
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:44:03 UTC from IEEE Xplore. Restrictions apply.
BOUTILLON et al.: VLSI ARCHITECTURES FOR THE MAP ALGORITHM 177

LUT is three bits for our example. An LUT is the most straight- the differences between the state metrics and not their absolute
forward way to perform this operation [9], [15]. In the general values that are important. Rescaling of the state metrics can be
case, there is a positive value such that performed.

if (10) B. Approximate Bound on the Dynamic Range of


the State Metrics
We also have
Let us define as the minimum number such that, for all
(11) , there is a path through the trellis between every state at time
and . Let us define as the maximum absolute value
with equality if (note that quantization of the LUT is also
of the branch metric. Then, for all , a rough bound on the
discussed in [14]).
dynamic range of the state metrics is
III. PRECISION OF STATE METRICS (14)
The precision of the state metrics is an important issue for Proof: Let and be, respectively, the
VLSI implementation. The number of bits used to code the maximum and minimum value of the state metric at time .
state metrics determines both the hardware complexity and the Then, according to the definition of , there is a path of length
speed of the hardware. This motivates the need for techniques in the trellis between every state at time and the state
which minimize the number of bits required without modifying with value at time . Since at every step, the
the behavior of the algorithm. maximum state metric (with, eventually, a positive correction
The same problem has been intensively studied for the Viterbi factor) is taken, in the worst case, among the path between
algorithm ([16], [17]) and solutions using rescaling or modulo and , the state metric can decrease by at each step. Thus,
2 arithmetic are widely used [18], [19]. These techniques are is at least equal to or greater than .
based on the fact that, at every instant , the dynamic range Similarly, the maximum increase at each stage of the state metric
of a state metric (i.e., the difference between the state metrics among this path is ( is the maximum value of
with the highest and lowest values), is bounded by . the correction factor added at each stage). Thus, is
The forward recursion in the MAP algorithm is slightly dif- lower than, or equal to, . Grouping
ferent than the Viterbi algorithm since: the upper bound of and the lower bound of
1) the outputs of the recursion are the state metrics them- leads to (14).
selves and not the decisions of the ACS; Note that in the case of a trellis corresponding to a shift reg-
2) the Add– is an ACS operation with an added offset ister of length , is equal to .
(ACSO).
These differences lead us to question whether the well-known C. Finer Bound on the Dynamic Range of the State Metrics
implementation techniques for the Viterbi algorithm are also ap- for a Convolutional Decoder
plicable to the MAP algorithm. The first part of this section A more precise bound can be obtained in the case of a convo-
shows that the LLR result is independent of a global shift of lutional decoder using the intrinsic properties of the encoder. All
all of the state metrics of the forward and backward recursions. the following developments are based on a previous work based
The bounds on the dynamic range of the state metric are then on the Viterbi algorithm [17]. Note that this problem has already
given. been independently addressed by Montorsi et al. [20], where
they extend through intensive simulation, without any formal
A. Rescaling the State Metrics
proof, the result obtained in [21] for the case of Viterbi algo-
Let us first show that the operator is shift invariant, rithm.
that is, it still produces a valid result if both of its arguments 1) Exact Bound on : The lower bound of can be ob-
have a common constant added to them. Let , , and be real tained using the Perron–Frobenius theorem [25]. Let us work in
numbers. From the definition of , it follows that: the probability domain and let us assume that the branch prob-
abilities

(12)

Thus (15)
are normalized so that

(16)
(13)
In a real system, the are bounded (by the analog–digital con-
According to (13), the operator is linear. Thus, a version) and the standard deviation of the noise is a nonzero
global shift of for all values (or ) would not value, thus, according to (15) and (16), we have the relation
change the value of , since the contribution of , when
put outside the two operators, is cancelled. Thus, it is for all (17)
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:44:03 UTC from IEEE Xplore. Restrictions apply.
178 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 51, NO. 2, FEBRUARY 2003

Let us first assume that the all-zero path is sent in the channel Finally, we justify the monotonic increasing of (which
and that all the received symbols have the highest possible relia- achieves the proof) by an intuitive argument. is the like-
bility. The forward recursion is performed on the received sym- lihood ratio between the state that has the highest probability
bols. Let us study, in this case, the ratio of state probabilities (state 0, by construction) and the state with the lowest proba-
between the state with the highest probability and the state with bility. Since every new incoming branch metric confirms state
the lower probability when the forward recursion is performed. 0, is an increasing function of .
Note that this ratio, in the log domain, is associated with the Using the same type of argument, if one, or more, of the first
maximum difference between state metrics, i.e., the dynamic received signals do not have the highest reliability, the resulting
range of the state metric. ratio will be smaller than .
The initial state vector is the Since the code is linear, the result obtained for the all-zero
uniformly distributed vector of length , where is the number sequence is true for all sequences of bits. Thus, the logarithm
of states of the trellis. of the ratio gives the maximum differences of the state
Since by hypothesis all the branch metrics are independent of metric.
time, we can express the forward recursion in an algebraic form 2) Exact Bound in Finite Precision: The exact maximum
using transition matrix difference obtained with a fixed precision architecture is
obtained from (19) starting from the all-zero vector until the
(18) system reaches stationarity, i.e., if all state metrics increase by
the same constant value at each iteration, is then equal
By recursion, we have to .
Note that this algorithm is a generalization of the algorithm
(19) proposed in [22] for the case of Viterbi decoder.
3) Simplification of the Computation of the Branch Metrics
By construction, is a positive irreducible matrix (the coeffi-
for a Convolutional Decoder: For a rate convolutional
cients are positive, and only performs a modification of the
code, is an -dimensional vector with elements
probability distribution of the state metric vector). Thus, ac-
{ } (or { 1, 0, 1} in the case of a punctured code
cording to the Perron–Frobenius theorem [25], can be ex-
where 0 is used for a punctured bit). Using (13), the computa-
pressed, in the basis of eigenvectors , by a diag-
tion of the branch metrics can be extended and simplified
onal matrix with the two properties:
1) , the Perron eigenvector of , is the only eigenvector of
that has all of its components positive;
2) , the Perron eigenvalue associated with is positive (23)
and for all . The first terms are common to all branch metrics, thus, they can
be dropped. The last terms can be decomposed on the dimen-
Since, in the trellis, all the states at time are connected to
sions of the vector. Thus, the modified branch metrics
the states at time , we deduce that all the coefficients of
are
are strictly positive. Using the Perron–Frobenius theorem
for gives an extra property: the Perron eigenvalue of is
strictly greater than its other eigenvalues. From this property,
we deduce that this property is also true for , i.e., (24)
for all . where takes the value of zero for a punctured code
Let be the decomposition of in the basis symbol. This expression can be used to find the exact bound
. The vector can be expressed as of .
4) Example: As an example, let us consider a recursive sys-
(20) tematic encoder with generator polynomials (7, 5). Moreover,
let us assume that the modified branch metrics are
coded using 128 levels, from 15.75 up to 15.75 (the inputs
with , for . are coded between 7.875 up to 7.875, with a step size
Let us call (respectively, ) the maximum of 0.125). We assume that the all-zero path is received with
(minimum) coordinate value of vector , and the ratio the maximum reliability. The resulting state transition diagram
. (with values of modified branch metrics) is given in Fig. 1.
Conjecture: For all , . Table I shows the evolution of the state metrics for the first eight
Proof: First, , since iterations of the forward recursion.
. Second, using (20), we have As shown in Table I, the value of does not increase after
seven iterations. The limit value is the maximum
(21) value of the state metric dynamic range obtained for our ex-
ample. The approximate bound of (14) gives, for this example,
and thus . The bound obtained by the above method is much
more precise and can lead to more efficient hardware realiza-
(22) tions, since the precision of the state metrics is reduced.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:44:03 UTC from IEEE Xplore. Restrictions apply.
BOUTILLON et al.: VLSI ARCHITECTURES FOR THE MAP ALGORITHM 179

Fig. 1. State transition of a systematic recursive encoder with polynomials (7,5) and modified branch metric when, for all k , (y = ; y = ) =
0
( 7:875; 0
7;875).

TABLE I
VARIATION OF STATE METRICS AFTER t STAGES ON THE ALL-ZERO PATH

Note that the initial state vector is important (the all-zero


vector). In the case where the initial state is known (state 0, for
example), using an initial state that gives the highest probability
possible for state zero and the lowest probability for all the other
states can lead to some transitory values greater than .
The natural solution to avoid this problem is to use the obtained
eigenvector (vector (47.250, 0, 15.750, 0) in this example). For
turbo-decoder applications, the method can also be used, taking
into account the extrinsic information as the initial state.

IV. ARCHITECTURE FOR THE FORWARD Fig. 2. Architecture of an ACSO.


AND BACKWARD RECURSIONS

This section is divided into two parts. The first part is a review
of the architecture usually used to compute the forward state
metrics [9]. The second part is an analysis of the position of the
register for the recursion loop in order to increase the speed of
the architecture.

A. Computation of the Forward State Metrics: ACSO Unit


The architecture of the processing unit that computes a new
value of is shown in Fig. 2. The structure consists of
the well-known ACS unit used for the Viterbi algorithm (grey
area in Fig. 2) and some extra hardware to generate the “offset”
corresponding to the correction factor of (2).
As said in Section II, the offset is generated directly with a
LUT that contains the precalculated result of .
Fig. 3. Three different positions of the register in the data flow of the forward
Then, the offset is added to the result of the ACS operation to (or backward) algorithm leading to three types of ACSO recursion architectures.
generate the final value of . In the following, we will call
this processor unit an ACSO unit. is the same as the one used for the Viterbi algorithm, and all
the literature on the speed-area tradeoffs for the ACS recursion
B. Architecture for the Forward State Metric Recursion can be reused for the ACSO computation. Nevertheless, there
The natural way to perform the forward state metric recursion is another position for the register which reduces the critical
is to place a register at the output of the ACSO unit, in order to path of the recursion loop. Fig. 3 shows two steps of a two-state
keep the value of for the next iteration. This architecture trellis.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:44:03 UTC from IEEE Xplore. Restrictions apply.
180 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 51, NO. 2, FEBRUARY 2003

Fig. 4. Architecture of an OACS unit.

Three different positions of the recursion loop register somehow, memory is needed to store a given type of vector
are shown. The first position is the classical one. It leads to (say, ), until the corresponding vector ( ) is generated.
an ACSO unit. The second position leads to a compare-se- Each state metric vector is composed of 2 state metrics (the
lect-offset-add (CSOA) unit, while the third position leads size of the trellis), each one bits wide. The total number of
to an offset-add-compare-select (OACS) unit. The last one, bits for each vector is large ( ) and thus, the reduction of
the OACS unit shown in Fig. 4, has a smaller critical path the number of state metrics is an important issue for minimizing
compared with the ACSO unit. Briefly, in the case of a ACSO the implementation area.
unit, the critical path is composed of the propagation of the The first part of this section describes the architecture of
carry ( ) in the first adder, the propagation of one full adder a high-speed VLSI circuit for the forward algorithm. Then,
( ) for the comparison (as soon as a result of the sum is through different steps, we propose several organizations of
available, it can be used for the comparison), the time of the computation that reduce the number of vectors that need to be
LUT access ( ) and the multiplexer ( ), and then, once stored by up to a factor of eight. Note that several authors have
more, the time of the propagation of the carry in the offset separately achieved similar results. This point will be discussed
addition. For the OACS unit, the critical path is only composed in the last section.
of the propagation of the carry in the first adder (the addition of
the offset), the propagation of one full adder for the addition of A. Classical Solutions [( ) and
the branch metric, another propagation of one full adder for the ( )] Architecture
comparison, and then, the maximum of the LUT access and the The first real-time MAP VLSI architectures in the literature
multiplexer. Thus, the critical path is decreased from are described in [11], [13], and [24]. The architecture of [11]
(25) and [13] is based on three recursion units (RUs), two used for
the backward recursion ( and ), and one forward
to unit ( ). Each RU contains operators working
(26) in parallel so that one recursion can be performed in one clock
cycle. The two backward RUs play a role similar to the two
The decrease of the critical path is paid for by an additional
trace-back units in the Viterbi decoder of [26].
register needed to store the offset value between two iterations.
Let us use the same graphical representation as in [11],
The area-speed tradeoff is determined by the specification of
[27], and [28] to explain the organization of the computation.
the application. As mentioned by one of the paper’s reviewers,
In Fig. 5, the horizontal axis represents time, with units of
a Carry–Save–Adder (CSA) architecture can also be efficiently
a symbol period. The vertical axis represents the received
used in this case [23].
symbol. Thus, the curve ( ) shows that, at time ,
The last step of the MAP algorithm is the computation of the the symbol { } becomes available. Let us describe how the
LLR value of the decoded bit. Parallel architectures for the LLR symbols are decoded (segment I of Fig. 5).
computation can be derived directly from (8). The first stage From to , performs recursions, starting
is composed of 2 adders. The second stage is composed of from down to (segment II of Fig. 5). This process
two 2 operand operators. Finally, the last operation is is initialized with the all-zero state vector , but after
the subtraction. A classical tree architecture can be used for the iterations, as noted in [11], the convergence is reached and
hardware realization of the operand operators. is then obtained. During those same cycles, generates
the vectors (segment III of Fig. 5). The vectors
V. GENERAL ARCHITECTURE are stored in the state vector memory (SVM)
Each element of the MAP architecture has now been de- until they are needed for the LLR computation (grey area
scribed. The last part of our survey on VLSI architectures of Fig. 5). Then, between and , starts
for the MAP algorithm is the overall organization of the from state to compute down to (segment IV
computation. Briefly speaking, the generation of the LLR of Fig. 5). At each cycle, the vector corresponding to the
values requires both and values, which are generated computed is extracted from the memory in order to compute
in chronologically reverse order. The first implication is that, . Finally, between and , the data are
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:44:03 UTC from IEEE Xplore. Restrictions apply.
BOUTILLON et al.: VLSI ARCHITECTURES FOR THE MAP ALGORITHM 181

Fig. 5. Graphical representation of a real-time MAP architecture. Fig. 6. Graphical representation of the (n = 1; n = 2; M )
architecture.

reordered (segment V of Fig. 5) using a memory for reversing


the LLR (light grey area of Fig. 5). The same process is then tively, the number of RUs used for the forward and backward
reiterated every cycles, as shown in Fig. 5. recursions. (for the memory of state metric ) indicates
In the case where the MAP unit is being used in a turbo de- that the vectors are stored for the LLR bit computation. Note
coder, the reordering can be done implicitly by the interleaver. that in this architecture, the forward recursion is performed 2
Moreover, the a priori information to be subtracted [1] can be cycles after the initial reception of data.
reversed in time in order to be directly subtracted after gener- With the ( ) architecture, state vec-
ation of the LLR value (segment IV of Fig. 5). Note that the tors have to be stored. The length of convergence is relatively
role of the memories is to reverse the order of the state vectors. small (a few times the constraint length ) but the size of the
Reordering of the state metrics can be done with a single RAM state vector is very large. In fact, a state vector is composed of 2
and an up/down counter during clock cycles. The incoming state metrics, each state metric is bits wide, i.e., bits
data are stored at addresses . In the next cycles, per state metric vector. The resulting memory is very narrow,
the counter counts down and the state vectors are retrieved from and thus, not well suited for a realization with a single RAM
and at the same time, the new incoming state vectors block, but it can be easily implemented by connecting several
are stored in the same RAM block (from addresses down small RAM blocks in parallel.
to 0). Only one read/write access is done at the same location The architecture ( ) is reported in [10]. It
every clock cycle. This avoids the need for multiport memories. is equivalent to the former one, except that the forward recursion
This graphical representation gives some useful information is performed 4 cycles after the reception of the data, instead
about the architecture. For example, the values of: of 2 cycles (segment V of Fig. 5 instead of segment III). In
1) the decoding latency: (horizontal distance be- this scheme, the vectors generated by are stored until
tween the array “acquisition” and “decoded bit”); the computation of the corresponding vectors by (light
2) the number of vectors to be stored: (maximum grey of Fig. 5). Then, the LLR values are computed in the natural
vertical size of the grey area); order.
3) the “computational cost” of the architecture, i.e., the total Other architectures have been developed. Each presents
number of forward and backward iterations performed for different tradeoffs between computational power, memory size,
each received data: (the number of arrows of RU and memory bandwidth. Their graphical representations are
cut by a vertical line). given below.
Note that to perform the recursions, branch metrics have to be
available. This can easily be done using three RAMs of size B. ( ) Architecture
that contain the branch metrics of the three last received blocks In this architecture, the forward recursion is performed 3
of size . Note that the RAM can simply store the received cycles after the reception of the data (see Fig. 6). Thus,
symbols. In that case, branch metrics are computed on the fly vectors and vectors have to be stored. The total number
every time they are needed. Since the amount of RAM needed of state vectors to be stored is still . Moreover, with this solu-
to store branch metric information is small compared with the tion, bits have to be decoded in the last clocks cycles of an
amount of RAM needed to store the state metric, evaluation of iteration, thus, two APP units have to be used. This scheme be-
branch metric computation will be omitted in the rest of the comes valuable when two independent MAP decoders work in
paper. parallel. Since two MAP algorithms are performed in parallel, it
In what follows, this architecture is referred to as is possible to share the memory words between the two MAP
( ), where and are, respec- algorithms by an appropriate interleaving of the two operations,
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:44:03 UTC from IEEE Xplore. Restrictions apply.
182 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 51, NO. 2, FEBRUARY 2003

Fig. 7. Graphical representation of the (n = 1; n = 3; M ) architecture. Fig. 8. Graphical representation of the (n = 2; n = 2; M ) architecture.

as shown in Fig. 6. In this figure, the second iteration is repre-


sented with dotted lines and the corresponding vector memory
with a striped region. This scheme can be used in a pipeline
of decoders to simultaneously decode two information streams.
With this interleaving, the amount of memory for state metrics
corresponding to each MAP is divided by two. Thus, the final
area will be smaller than the simple juxtaposition of two “clas-
sical” MAP algorithms. With this solution, two read and two
write accesses are needed at each symbol cycle. Those accesses
can be shared harmoniously with two RAMs of size with
a read and write access at the same address for each of the two
RAMs. Fig. 9. Simplified ACSO unit.
The MAP architecture can use more than two RUs for the
backward recursion and/or more than one RU for the forward By storing, during cycles, each decision and offset value gen-
recursion. The following sections describe some interesting so- erated by , the complexity of is almost divided by
lutions. two (see Fig. 9).
This method is very similar to the method used for the soft-
C. ( ) Architecture output Viterbi algorithm [29].
An additional backward unit leads to the schedule of Fig. 7. A Note that once more, an ( )
new backward recursion is started every cycles on a length method can be used.
of symbols. The first steps are used to achieve conver-
gence, then the last steps generate vectors . The new E. ( ) Architecture
latency is now 3 , and the amount of memory needed to store
This type of architecture is a generalization of the idea de-
the vectors is only . Two observations are worth noting:
scribed above: instead of memorizing a large number of (or
1) the reduction of the latency and the memory size is paid ) vectors, they are recomputed when they are needed. For this,
for by a new backward unit; the context (i.e., the state metrics) of an iteration process is saved
2) a solution of type ( ) can also in a pointer. This pointer is used later to recompute, with a delay,
be used. the series state metric. Such a process is given in Fig. 10.
In this scheme, the state metrics of are saved every
D. ( ) Architecture cycles (small circles in Fig. 10). Those four state metrics
The addition of an extra forward unit can also decrease the are used as a seed, or pointer, to start the third backward process
SVM by a factor of two, as shown in Fig. 8. This scheme has ( , in Fig. 10) of length . The third backward recur-
the same number of processing units ( ) and the sion is synchronized with the forward recursion in order to min-
same state metric memory size as the ( imize the size of the vector to be stored. In practice, only
) architecture, but its latency is 4 compared with 3 for three seeds are needed, since and process the same
the architecture of the previous section. However, the second re- data during the last quarter of a segment of cycles. With this
cursion unit can be simplified, since it only copies, with a method, the latency is still 4 , but the number of state metrics
time shift of cycles, the computation of . Thus, there ex- to store is now . With such a small number of vec-
ists a tradeoff between computational complexity and memory. tors, the use of registers instead of RAM can be used to store
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:44:03 UTC from IEEE Xplore. Restrictions apply.
BOUTILLON et al.: VLSI ARCHITECTURES FOR THE MAP ALGORITHM 183

Fig. 10. Graphical representation of the (n = 1; n = 3; M ;Pt ) Fig. 12. Graphical representation of the (p = 2; n = 1; n = 1; M ;

architecture. Pt ) architecture.

the RU is doubled. Thus, an architecture such as (


) can be used (see Fig. 12) to obtain an
SVM of size .

G. Summary of the Different Configurations


In Table II, different configurations are evaluated in order to
help the designer of a system. Note that is a generalization
factor and that 0.5 (in columns and ) denotes the simpli-
fied ACSO unit of Fig. 9. We can see that in the case of two
MAP algorithms implemented together in the same circuit, it is
possible to decrease the number of vectors from to .
This reduction allows the realization of this memory using only
registers.
Note that the final choice of a solution among the different
proposed alternatives will be made by the designer. The de-
signer’s objective is to optimize area and/or power dissipation of
the design while respecting application requirements (decoding
Fig. 11. Graphical representation of the (n = 1; n = 3; M ; latency, performance). The complexity of the MAP algorithm
P t ) architecture.
depends on the application (continuous stream or small blocks,
simple or duo-binary encoder [30], [31], number of encoder
the state metrics. This avoids the use of a RAM with an un- states, etc.). The consequence is that the merit of the proposed
usual aspect ratio and a consequent negative impact on perfor- solution can vary with the application and no general rules can
mance. This scheme becomes particularly attractive if two in- be found. In practice, a fast and quite accurate complexity esti-
dependent MAP algorithms are implemented in a single chip, mation can be obtained in terms of gate count and memory cells
since an ( ) architecture can by simply using a field-programmable gate array synthesis tool
be used to share the vectors of the two MAP algorithms to compile a VHDL or Verilog algorithm description.
(see Fig. 11). As with the ( ) archi-
tecture, this scheme can be used in a pipeline of decoders to H. Similar Works in This Area
simultaneously decode two information streams. Since the first submission of this paper, much work has been
This scheme is particularly efficient because it avoids the use independently published on this topic. In this final subsection,
of a RAM for storing the state metrics. we give a brief overview of these fundamental works.
The architecture of Sections V-B–D has also been proposed
F. Generalization of the Architecture by Schurgers et al. In [32] and [33], the authors give a very
Many combinations of the above architectures can be real- detailed analysis of the tradeoffs between complexity, power
ized, each one with its own advantages and disadvantages. In dissipation, and throughput. Moreover, they propose a very in-
the above examples, the ratio between the hardware clock and teresting architecture of double flow structures, where for ex-
the symbol clock is one. Other architectures can be uncovered ample, two processes of type ( ) and
by loosening this restriction. For example, if this ratio is two ( ) are performed in parallel on a data
(i.e., two clock cycles for each received symbol), the speed of block of size , the first one, in natural order, from data 0
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:44:03 UTC from IEEE Xplore. Restrictions apply.
184 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 51, NO. 2, FEBRUARY 2003

TABLE II
PERFORMANCE OF THE DIFFERENT ARCHITECTURES

to , the second, in reverse order, from data down to [2] R. W. Chang and J. C. Hancock, “On receiver structures for channels
. Moreover, Worm et al. [34] extend the architecture of Sec- having memory,” IEEE Trans. Inform. Theory, vol. IT-12, pp. 463–468,
Oct. 1966.
tions V-A and -B for a massively parallel architecture where [3] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear
several processes are done in parallel. With this massive paral- codes for minimizing symbol error rate,” IEEE Trans. Inform. Theory,
lelism, very high throughput (up to 4 Gbit/s) can be achieved. vol. IT-20, pp. 284–287, Mar. 1974.
[4] A. J. Viterbi, “Error bounds for convolutional codes and an asymptoti-
The pointer idea described in Section V-E has been proposed cally optimum decoding algorithm,” IEEE Trans. Inform. Theory, vol.
independently by Dingninou et al. in the case of a turbo decoder IT-13, pp. 260–269, Apr. 1967.
in [35] and [36]. In this “sliding window next iteration initializa- [5] G. D. Forney, Jr., “The Viterbi algorithm,” Proc. IEEE, vol. 61, pp.
268–278, Mar. 1973.
tion” method, the pointer generated by the backward recursion [6] J. Hagenauer, E. Offer, and L. Papke, “Iterative decoding of binary
at iteration is used to initialize the backward recursion at itera- block and convolutional codes,” IEEE Trans. Inform. Theory, vol. 42,
tion . As a result, no further backward convergence process pp. 429–445, Mar. 1996.
[7] A. J. Viterbi, “An intuitive justification and a simplified implementation
is needed and area and memory are saved at the cost of a slight of the MAP decoder for convolutional codes,” IEEE J. Select. Areas
degradation of the decoder performance. Note that Dielissen et Commun., vol. 16, pp. 260–264, Feb. 1998.
al. have improved this method by an efficient encoding of the [8] N. G. Kingsbury and P. J. W. Rayner, “Digital filtering using logarithmic
arithmetic,” Electron. Lett., vol. 7, no. 2, pp. 56–58, Jan. 1971.
pointer [37]. [9] J. A. Erfanian and S. Pasupathy, “Low-complexity parallel-structure
Finally, an example of an architecture using a ratio of two be- symbol-by-symbol detection for ISI channels,” in Proc. IEEE Pacific
tween clock frequency and symbol frequency (see Section V-F) Rim Conf. Communications, Computers and Signal Processing, June
1–2, 1989, pp. 350–353.
is partially used in [38]. [10] H. Dawid, Algorithms and VLSI Architecture for Soft Output Maximum
a Posteriori Convolutional Decoding (in German). Aachen, Germany:
VI. CONCLUSION Shaker, 1996, p. 72.
[11] H. Dawid and H. Meyr, “Real-time algorithms and VLSI architectures
We have presented a survey of techniques for VLSI imple- for soft output MAP convolutional decoding,” in Proc. Personal, In-
door and Mobile Radio Communications, PIMRC’95, vol. 1, 1995, pp.
mentation of the MAP algorithm. As a general conclusion, the 193–197.
well-known results from the Viterbi algorithm literature can be [12] S. S. Pietrobon, “Efficient implementation of continuous MAP decoders
applied to the MAP algorithm. The computational kernel of the and a new synchronization technique for turbo decoders,” in Proc. Int.
Symp. Information Theory and Its Applications, Victoria, BC, Canada,
MAP algorithm is very similar to that of the ACS of the Viterbi Sept. 1996, pp. 586–589.
algorithm with an added offset. The analysis shows that it is [13] S. S. Pietrobon and S. A. Barbulescu, “A simplification of the modified
better to add the offset first and then do the ACS operation in Bahl algorithm for systematic convolutional codes,” in Proc. Int. Symp.
Information Theory and Its Applications, Sydney, Australia, Nov. 1994,
order to reduce the critical path of the circuit (OACS). A gen- pp. 1073–1077.
eral architecture for the MAP algorithm was developed which [14] S. S. Pietrobon, “Implementation and performance of a turbo/MAP de-
exposes some interesting tradeoffs for VLSI implementation. coder,” Int. J. Satellite Commun., vol. 16, pp. 23–46, Jan.-Feb. 1998.
[15] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison of optimal and
Most importantly, we have presented architectures which elimi- sub-optimal MAP decoding algorithms operating in the log domain,” in
nate the need for RAMs with a narrow aspect ratio and possibly Proc. IEEE Int. Conf. Communications (ICC ’95), 1995, pp. 1009–1013.
allow the RAM to be replaced with registers. An architecture [16] C. B. Shung, P. H. Siegel, G. Ungerboeck, and H. K. Thapar, “VLSI
architectures for metric normalization in the Viterbi algorithm,” in Proc.
which shares a memory bank between two MAP decoders en- IEEE Int. Conf. Communications (ICC ’90), vol. 4, Atlanta, GA, Apr.
ables efficient implementation of turbo decoders. 16–19, 1990, pp. 1723–1728.
[17] P. Tortelier and D. Duponteil, “Dynamique des métriques dans l’al-
gorithme de Viterbi,” Annales des Télécommun., vol. 45, no. 7-8, pp.
ACKNOWLEDGMENT 377–383, 1990.
The authors would like to thank F. Kschischang and O. [18] G. Masera, G. Piccinini, M. R. Roch, and M. Zamboni, “VLSI archi-
tectures for turbo codes,” IEEE Trans. VLSI Syst., vol. 7, pp. 369–379,
Pourquier for their help on the Perron–Frobenius theorem. Sept. 1999.
[19] A. Worm, H. Michel, F. Gilbert, G. Kreiselmaier, M. Thul, and N. Wehn,
REFERENCES “Advanced implementation issues of turbo decoders,” in Proc. 2nd Int.
Symp. on Turbo Codes, Brest, France, Sept. 2000, pp. 351–354.
[1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit [20] G. Montorsi and S. Benedetto, “Design of fixed-point iterative decoders
error-correcting coding and decoding: Turbo codes,” in Proc. IEEE Int. for concatenated codes with interleavers,” IEEE J. Select. Areas
Conf. Communications (ICC’93), May 1993, pp. 1064–1070. Commun., vol. 19, pp. 871–882, May 2001.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:44:03 UTC from IEEE Xplore. Restrictions apply.
BOUTILLON et al.: VLSI ARCHITECTURES FOR THE MAP ALGORITHM 185

[21] A. P. Hekstra, “An alternative to metric rescaling in Viterbi decoders,” Warren J. Gross (S’92) was born in Montreal, QC,
IEEE Trans. Commun., vol. 37, pp. 1220–1222, Nov. 1989. Canada, in 1972. He received the B.A.Sc. degree
[22] P. H. Siegel, C. B. Shung, T. D. Howell, and H. K. Thapar, “Exact bounds in electrical engineering from the University of
for Viterbi detector path metric differences,” in Proc. Int. Conf. Acous- Waterloo, Waterloo, ON, Canada, in 1996 and the
tics, Speech, and Signal Processing, vol. 2, 1991, pp. 1093–1096. M.A.Sc. degree in 1999 from the University of
[23] G. Fettweis and H. Meyr, “Parallel Viterbi algorithm implementation: Toronto, Toronto, ON, Canada, where he is currently
Breaking the ACS bottleneck,” IEEE Trans. Commun., vol. 37, pp. working toward the Ph.D. degree.
785–790, Aug. 1989. From 1993 to 1996, he worked in the area of space-
[24] H. Dawid, G. Gehnen, and H. Meyr, “Map channel decoding: Algorithm based machine vision at Neptec Design Group, Ot-
and VLSI architecture,” VLSI Signal Processing VI, pp. 141–149, 1993. tawa, ON, Canada. His research interests are in the
[25] F. R. Gantmacher, Matrix Theory. New York: Chelsea, 1960, vol. II. areas of VLSI architectures for digital communica-
[26] 20 Mbps convolutional encoder Viterbi decoder STEL-2020: Stanford tions algorithms and digital signal processing, coding theory, and computer ar-
Telecom, 1989. chitecture.
[27] E. Boutillon and N. Demassieux, “A generalized precompiling scheme Mr. Gross received the Natural Sciences and Engineering Research Council
for surviving path memory management in Viterbi decoders,” in Proc. of Canada postgraduate scholarship, the Walter Sumner fellowship and the Gov-
ISCAS’93, vol. 3, New Orleans, LA, May 1993, pp. 1579–1582. ernment of Ontario/Ricoh Canada Graduate Scholarship in Science and Tech-
[28] E. Boutillon, “Architecture et implantation VLSI de techniques de mod- nology.
ulations codées performantes adaptées au canal de Rayleigh,” Ph.D. dis-
sertation, ENST, Paris, France, 1995.
[29] J. Hagenauer and P. Hoeher, “A Viterbi algorithm with soft-decision out-
puts and its applications,” in Proc. IEEE Globecom Conf., Nov. 1989, pp.
1680–1686. P. Glenn Gulak (S’82–M’83–SM’96) received the
[30] C. Douillard, M. Jézéquel, C. Berrou, N. Bengarth, J. Tousch, and N. Ph.D. degree from the University of Manitoba, Win-
Pham, “The turbo code standard for DVB-RCS,” in Proc. 2nd Int. Symp. nipeg, MB, Canada.
on Turbo Codes, Brest, France, Sept. 2000, pp. 535–538. From 1985 to 1988, he was a Research Associate
[31] C. Berrou and M. Jézéquel, “Nonbinary convolutional codes for turbo with the Information Systems Laboratory and the
coding,” Electron. Lett., vol. 35, no. 1, pp. 39–40, Jan. 1999. Computer Systems Laboratory, Stanford University,
[32] C. Schurgers, F. Catthoor, and M. Engels, “Energy efficient data transfer Stanford, CA. Currently, he is a Professor with the
and storage organization for a MAP turbo decoder module,” in Proc. Department of Electrical and Computer Engineering,
1999 Int. Symp. Low Power Electronics and Design, San Diego, CA, University of Toronto, Toronto, ON, Canada, and
Aug. 1999, pp. 76–81. holds the L. Lau Chair in Electrical and Computer
[33] , “Memory optimization of MAP turbo decoder algorithms,” IEEE Engineering. His research interests are in the areas
Trans. VLSI Syst., vol. 9, pp. 305–312, Apr. 2001. of memory design, circuits, algorithms, and VLSI architectures for digital
[34] A. Worm, H. Lamm, and N. Wehn, “VLSI architectures for high-speed communications.
MAP decoders,” in Proc. 14th Int. Conf. VLSI Design, 2001, pp. Dr. Gulak received a Natural Sciences and Engineering Research Council of
446–453. Canada Postgraduate Scholarship and several teaching awards for undergrad-
[35] A. Dingninou, “Implémentation de turbo code pour trame courtes,” uate courses taught in both the Department of Computer Science and the Depart-
Ph.D. dissertation, Univ. de Bretagne Occidentale, Bretagne, France, ment of Electrical and Computer Engineering, University of Toronto, Toronto,
2001. ON, Canada. He served as the Technical Program Chair for ISSCC 2001. He is
[36] A. Dingninou, F. Rafaoui, and C. Berrou, “Organization de la mémoire a registered professional engineer in the province of Ontario.
dans un turbo décodeur utilisant l’algorithme SUB-MAP,” in Proc.
Gretsi, Gretsi, France, Sept. 1999, pp. 71–74.
[37] J. Dielissen and J. Huisken, “State vector reduction for initialization of
sliding windows MAP,” in Proc. 2nd Int. Symp. Turbo Codes, Brest,
France, Sept. 2000, pp. 387–390.
[38] A. Raghupathy and K. J. R. Liu, “VLSI implementation considerations
for turbo decoding using a low-latency log-MAP,” in Proc. IEEE Int.
Conf. Consumer Electronics, ICCE, June 1999, pp. 182–183.

Emmanuel Boutillon received the engineering de-


gree in 1990 and the Ph.D. degree in 1995, both from
the Ecole Nationale Supérieure des Télécommunica-
tions (ENST), Paris, France.
He joined ENST in 1992, where he conducted re-
search in the field of VLSI for communications. In
1998, he spent a sabbatical year at the University of
Toronto, Toronto, ON, Canada, where he worked on
algorithms and architectures for MAP and LDPC de-
coding. Since 2000, he is a Professor at the Univer-
sity of South Britany, Lorient, France. His current re-
search interests are on the interactions between algorithms and architectures in
the field of wireless communication.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:44:03 UTC from IEEE Xplore. Restrictions apply.

You might also like