Turbo Tutorial C
Turbo Tutorial C
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Turbo Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 First Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Putting Turbo on the Turbo Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Performance Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3. APP Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4. Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5. Selected Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1
2
1. Introduction
Concatenated coding is illustrated in Figure 1. Here we see the information frame illus-
trated as a square - assuming block interleaving - and we see the parity from the vertical
encoding and the parity from the horizontal encoding. For serial concatenation the parity
bits from one of the constituent codes are encoded with the second code and we have parity
of parity. If the codes are working in parallel, we do not have this additional parity.
The idea of concatenated coding fits well with Shannon’s channel coding theorem, stating
that as long as we stay on the right side of the channel capacity we can correct everything -
if the code is long enough. This also means that if the code is very long, it does not have to
be optimal. The length in itself gives good error correcting capabilities, and concatenated
coding is just a way of constructing - and especially decoding - very long codes.
3
2. Turbo Codes
2.1 Encoding
We can regard the turbo code as a large block code. The performance depends on the
weight distribution - not only the minimum distance but the number of words with low
weight. Therefore, we want input patterns giving low weight words from the first encoder
to be interleaved to patterns giving words with high weight for the second encoder.
Convolutional codes have usually been encoded in their feed-forward form, like
(G1,G2)=(1+D2,1+D+D2). However, for these codes a single 1, i.e. the sequence
...0001000..., will give a codeword which is exactly the generator vectors and the weight of
this codeword will in general be very low. It is clear that a single 1 will propagate through
any interleaver as a single 1, so the conclusion is that if we use the codes in the feed-for-
ward form in the turbo scheme the resulting code will have a large number of codewords
with very low weight.
The trick is to use the codes in their recursive systematic form where we divide with one of
the generator vectors. Our example gives (1,G2/G1)=(1,(1+D+D2)/(1+D2)). This operation
does not change the set of encoded sequences, but the mapping of input sequences to out-
4
put sequences is different. We say that the code is the same, meaning that the distance
properties are unchanged, but the encoding is different.
In Figure 3 we have shown an encoder on the recursive systematic form. The output se-
quence we got from the feed-forward encoder with a single 1 is now obtained with the
input 1+D2=G1. More important is the fact that a single 1 gives a codeword of semi-infinite
weight, so with the recursive systematic encoders we may have a chance to find an inter-
leaver where information patterns giving low weight words from the first encoder are inter-
leaved to patterns giving words with high weight from the second encoder. The most criti-
cal input patterns are now patterns of weight 2. For the example code the information se-
quence ...01010... will give an output of weight 5.
One thing is important concerning the systematic property, though. If we transmit the sys-
tematic part from both encoders, this would just be a repetition, and we know that we can
construct better codes than repetition codes. The information part should only be transmit-
ted from one of the constituent codes, so if we use constituent codes with rate 1/2 the final
rate of the turbo code becomes 1/3. If more redundancy is needed, we must select constitu-
ent codes with lower rates. Likewise we can use puncturing after the constituent encoders
to increase the rate of the turbo codes.
Now comes the question of the interleaving. A first choice would be a simple block
interleaver, i.e. to write by row and read by column. However, two input words of low
5
weight would give some very unfortunate patterns in this interleaver. The pattern is shown
in Figure 4 for our example code. We see that this is exactly two times the critical two-
input word for the horizontal encoder and two times the critical two-input pattern for the
vertical encoder as well. The result is a code word of low weight (16 for the example code)
- not the lowest possible, but since the pattern appears at every position in the interleaver
we would have a large number of these words.
. . . . . . . . . . .
This time the trick is to use a pseudo-random . . . 0 0 0 0 0 . . .
interleaver, i.e. to read the information bits to the . . . 0 1 0 1 0 . . .
second encoder in a random (but fixed) order. . . . 0 0 0 0 0 . . .
. . . 0 1 0 1 0 . . .
The pattern from Figure 4 may still appear, but . . . 0 0 0 0 0 . . .
not nearly as often. On the other hand we now . . . . . . . . . . .
have the possibility that a critical two-input pat- Figure 4 Critical pattern in
tern is interleaved to another critical two-input block interleaver
It is possible to find interleavers that are slightly better than the pseudo-random ones, some
papers on this topic are included in the literature list.
We will end this section by showing a more detailed drawing of a turbo encoder, Figure 5.
Here we see the two recursive systematic encoders, this time for the code
(1,(1+D4)/(1+D+D2+D3+D4)). Notice that the systematic bit is removed from one of them.
At the input of the constituent encoders we see a switch. This is used to force the encoders
to the all-zero state - i.e. to terminate the trellis. The complete incoming frame is kept in a
6
buffer from where it is read out
with two different sets of ad-
dresses - one for the original
sequence and one for the inter-
leaved one. This way output 1
and output 2 correspond to the
same frame and can be merged
Figure 5 Turbo encoder example
before transmission.
Decoding of error correcting codes is basically a comparison of the probabilities for differ-
ent codewords - or with convolutional codes, different paths in the trellis. When we talk
about probabilities, it is always the probability of some event given a certain amount of
information about this event. This is especially clear when we talk about probabilities of
something that has already happened - which is always the case in coding theory. What we
mean when we talk about the probability that x was sent, p(x), is the probability that x was
sent given the amount of information that we have about the event. Usually that is only the
received noisy version of x - and of course knowledge of the coding scheme, transmission
link etc.
In some cases we have some knowledge of the transmitted signal - before we decode the
received one. That may be information that some messages are more likely to occur than
others or information from other transmitted sequences. We call this information a priori
information and have the corresponding a priori probabilities. Similar we talk about a pos-
teriori probabilities when we have included both the a priori information probabilities and
the information gained by the decoding.
For turbo codes we have two encoded sequences. Clearly we must start by decoding one of
them to get a first estimate of the information sequence. This estimate should then be used
as a priori information in the decoding of the second encoded sequence. This requires that
7
the decoder is able to use a soft decision input and to produce some kind of soft output.
The decoding is sketched in Figure 6.
The standard decoder for turbo codes is the A Posteriori Probability decoding (APP)
(sometimes referred to as the Maximum A Posteriori decoding algorithm (MAP)). The
APP decoder, described in Section 3, does indeed calculate the a posteriori probabilities for
each information bits.
We will represent the soft input/output as log-likelihood ratios, i.e. a signed number where
negative numbers indicate that zero is the most likely value of the bit. As seen from For-
mula 1 the log-likelihood ratio of the a posteriori probabilities can easily be divided into
two components - the log- likelihood ratio of the a priori probabilities of the bit dt and the
information gained by the current observation. This means that when we gain additional
information about the information bits - like with the second decoding - we simply add a
(negative or positive) component to the log-likelihood ratio.
Pr dt 1 , observation
( dt ) log
Pr dt 0 , observation
Prap dt 1 Pr observation dt 1
log log (1)
Prap dt 0 Pr observation dt 0
Prap dt 1
log ( dt )
Prap dt 0
8
2.3 Putting Turbo on the Turbo Codes.
When we have a parity equation, it involves a number of information bits. Let us look at
one of the simplest possible parity equations - a sum of two information bits: P=I1+I2. It is
clear that if both P and I2 are very reliable we get a reliable estimate of I1, on the other
hand if I2 is very unreliable we do not get much information about I1. If we now imagine
that both I1 and I2 are unreliable when we decoded the first sequence, but that I2 is in-
volved in some parity equations with very reliable bits in the second encoded sequence -
then we might return to the parity equations from the first sequence for a second iteration
with this new and much more reliable estimate of I2. This way we could continue to
decode the two encoded sequences and iterate towards the final decision.
However, it is not that easy since we must be very careful not to use our information
more than once. Luckily we see from Formula 1, that it is easy to subtract the a priori
information - which came from the other decoder - from the decoder output. This will
prevent most of the unwanted positive feed-back. We may still have loops in the decision
process, though, i.e. we might see that I1 influences I2 in the first decoder, that I2
influences I3 in the second decoder and finally that I3 influences I1 in the next iteration in
the first decoder. This way the new improved estimate of I1 will be based on information
that came from I1 in the first place.
Use of the system in practice has shown that if we subtract the log-likelihood ratio of the a
priori information after each constituent decoder and make a number of decoding iterations
we get a system that is working remarkably well - for many applications it actually
outperforms the previously known systems. Still, we must conclude that the final result
after turbo decoding is a sub-optimal decoding due to the loops in the decision process. For
low signal-to-noise ratios we may even see that the decoding does not converge to anything
close to the transmitted codeword.
9
Figure 7 Turbo decoder
We will show an example of the performance with turbo codes. We use the system illus-
trated in Figure 5, i.e. the code (1,(1+D4)/(1+D+D2+D3+D4) for both encoders but the infor-
mation sequence is only transmitted from the first one. This means that the over-all rate is
1/3. The block length is 10384 bits and we use a pseudo-random interleaver. After each
frame the encoders are forced to the zero state. The corresponding termination tail - 4 in-
formation bits and 4 parity bits for each encoder, a total of 16 bits - is appended to the
transmitted frame and used in the decoder. In principle the termination reduces the rate, but
for large frames this has no practical influence. In this case the rate is reduced from 0.3333
to 0.3332.
The performance curves for Bit Error Rate (BER) and Frame Error Rate (FER) are shown
in Figure 8. Due to the sub-optimal decoding the performance curves consist of two parts.
For low signal-to-noise ratios the main problem is lack of convergence in the iterated de-
coding process, resulting in frames with a large number of errors. In this region we are far
from optimal decoding. This means that we may benefit from more iterations. As we see
from the figure there is a considerable gain by going from 8 to 18 iterations, and with more
iterations the performance might be even better.
10
For high signal-to-noise ratios the decoding is almost optimal, and the main problem is
codewords of low weight. This region is usually referred to as the error-floor since the
improvement for increasing signal-to-noise ratio is very small. In spite of the name it is not
a true floor, since the BER and FER is constantly decreasing - although not nearly as fast as
for the low signal-to-noise ratios. Notice that when the signal to noise ratio is high a small
number of iterations is sufficient.
11
3. APP Decoding
The A Posteriori Probability (APP) algorithm does in fact calculate the a posteriori prob-
abilities of the transmitted information bits for a convolutional code. In this presentation
we will restrict ourselves to convolutional codes with rate 1/n.
The convolutional encoder with memory M (Figure 3) may be seen as a Markov source
with 2M states St, input dt and output Xt. The output Xt and the new state St are functions
of the input dt and the previous state St-1.
If the output Xt is transmitted through a Discrete Memoryless Channel with white Gauss-
ian noise. The probability of receiving Yt when Xt was sent is
(ytj x tj)2
n
1 2 2
(2)
Pr Yt Xt e
j 1 2
where xtj is the j-th bit of the transmitted word Xt, and ytj the corresponding received
value. The signal to noise ratio is Es/N0=1/2 2. In principle knowledge of the signal-to-
noise ratio is needed for the APP algorithm. However, it may be chosen to a fixed value -
depending on the operation point of the system - with only a small degradation of the per-
formance.
Assume that we receive the sequence Y1L = Y1,Y2,...YL. The a posteriori probabilities of
the state transitions (i.e. branches) are found as:
L
L Pr St 1 m , St m , Y1
Pr St 1 m , St m Y1 , t 1...L (3)
L
Pr Y1
Pr { Y1L } is a constant for a given received sequence and since we consider rate 1/n
codes only, there is one specific information bit associated with each state transition. We
therefore define
12
L
t(i,m ) Pr dt i , St 1 m , Y1 (4)
t( 1 , m )
Pr dt 1 , observation m
( dt ) log (5)
Pr dt 0 , observation t( 0 , m )
m
In order to calculate t(i,m’) we define the following probability functions t(m), t(m),
and t(i,m’) as
t
t (m ) Pr St m , Y1 (6)
L
t (m ) Pr Yt 1 St m (7)
t(i,m ) Pr dt i , Yt St 1 m (8)
Compared to the Viterbi algorithm t (m) corresponds to the state metrics, while t(i,m’)
corresponds to the branch metrics. t(m) can be seen as backwards state metrics.
For the notation we will also need the function giving the new encoder state St when St-1
= m’ and dt=i
newstate( i , m )
and the function giving the old encoder state St-1 when St=m and dt=i
oldstate(i,m)
Since the encoder is a Markov process and the channel is memoryless, we have
L t L
Pr Yt 1 St m , Y1 Pr Yt 1 St m (11)
and
13
t 1
t(i,m ) Pr St 1 m , Y1 Pr dt i , Yt St 1 m
L
Pr Yt 1 St newstate( i , m ) (12)
t 1 (m ) t( i , m ) t (newstate( i , m ))
If we assume that the frames are terminated to state 0, we have 0(0)=1, and 0(m)= 0,
m=1,2,...2M 1. We can calculate as a forward recursion
t
t (m) Pr dt i , St 1 oldstate(i , m) , Y1
i 0,1
t 1
Pr St 1 oldstate(i , m) , Y1 Pr dt i , Yt St 1 oldstate(i , m) (13)
i 0,1
At the end of the frame we have L(0)=1, and L (m)=0, m=1,2,...2M 1. We can calculate
as a backward recursion
L
t (m ) Pr dt 1 i, Yt 1 St m
i 0,1
L
Pr dt 1 i, Yt 1 S t m Pr Yt 2 St 1 newstate(i,m) (14)
i 0,1
t 1( i, m ) t 1( newstate( i , m))
i 0,1
If the frames are not terminated we have no knowledge of the initial and final states. In
this case we must use 0 (m)= L (m)=2-M.
t
Since t (m) Pr St m , Y1 becomes very small with increasing t some rescaling must
t
t Pr St m , Y1
t (m) Pr St m Y1 (15)
t
Pr Y1
where Pr{Y1t} is found as the sum of t(m) over all states, meaning that the '
t(m) values
always add up to one. However, since the output is the log-likelihood ratio the actual
14
rescaling is not important as long as underflows are avoided. Similar the function t(m)
needs rescaling.
The algorithm sketched here requires that t (m) is stored for the complete frame since we
have to await the end of the frame before we can calculate t (m). We can instead use a
sliding window approach with period T and training period Tr. First t (m) is calculated and
stored for t=0 to T-1. The calculation of t (m) is initiated at time t=T+Tr-1 with initial
conditions (m)=2-M. The first Tr values
T+Tr-1 t (m) is discarded but after the training pe-
riod, i.e. for t=T-1 down to 0, we assume that t(m) is correct and ready for the calculation
of t(i,m’). After the first window we continue with the next one until we reach the end of
the frame where we use the true final conditions for L (m).
Of course, this approach is an approximation but if the training period is carefully chosen
the performance degradation can be very small.
Since we have only one output associated with each transition, we can calculate t (i,m’) as
For turbo codes the a priori information typically arrives as a log-likelihood ratio. Luckily
we see from the calculation of t (m) and t (m) that t (i,m’) is always used in pairs -
t (0,m’) and t(1,m’). This means we can multiply t(i,m’) with a constant
1
kt (17)
Prapriori dt 0
and get
Prapriori dt 1
t (1, m ) Pr Yt dt 1 , St 1 m (18)
Prapriori dt 0
t (0, m ) Pr Yt dt 0 , St 1 m (19)
15
For an actual implementation the values of t(m), t(m) and t(i,m’) may be represented as
the negative logarithm to the actual probabilities. This is also common practice for Viterbi
decoders where the branch and state metrics are -log to the corresponding probabilities.
With the logarithmic presentation multiplication becomes addition and addition becomes
an E-operation, where
x
x E y log(e e y ) min(x , y) log(1 e |y x|
) (20)
This function can be reduced to finding the minimum and adding a small correction factor.
As seen from Formula 18 the incoming log-likelihood ratio ’, can be used directly in the
calculation of -log( ) as the log-likelihood ratio of the a priori probabilities.
16
17
4. Final Remarks
This tutorial was meant as a first glimpse on the turbo codes and the iterated decoding
principle. Hopefully, we have shed some light on the topic, if there are still some dark spots
- try reading it again!
Of course there are a lot of details not explained here, a lot of variation to the turbo coding
scheme and a lot of things that may need a proof. Some of these can be found in the papers
on the literature list.
18
19
5. Selected Literature
[1] Jakob Dahl Andersen, “Turbo Codes Extended with Outer BCH Code”, Electron-
ics Letters, vol. 32 No. 22, Oct. 1996.
[2] J. Dahl Andersen and V. V. Zyablov, “Interleaver Design for Turbo Coding”,
Proc. Int. Symposium on Turbo Codes, Brest, Sept. 1997.
[3] Jakob Dahl Andersen, “Selection of Component Codes for Turbo Coding based
on Convergence Properties”, Annales des Telecommunication, Special issue on
iterated decoding, June 1999.
[8] C. Berrou and A. Glavieux, “Near Optimum Error Correcting Coding and Decod-
ing: Turbo Codes”, IEEE trans. on Communications, Vol. 44, No. 10, Oct. 1996.
[9] R. J. McEliece, E. R. Rodemich and J.-F. Cheng, “The Turbo Decision Algo-
rithm”, Presented at the 33rd Allerton Conference on Communication, Control
and Computing, Oct. 1995.
20
[10] L. C. Perez, J. Seghers and D. J. Costello, Jr, ”A Distance Spectrum Interpretation
of Turbo Codes”, IEEE Trans. on Inform. Theory, Vol. 42, No. 6, Nov. 1996.
21