100% found this document useful (1 vote)
393 views

Turbo Decoding Using SOVA

Uploaded by

Raghu Sandilya
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
393 views

Turbo Decoding Using SOVA

Uploaded by

Raghu Sandilya
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

UNIVERSIDAD POLITECNICA DE MADRID

ESCUELA TECNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACION

PROYECTO FIN DE CARRERA

TURBO DECODER IMPLEMENTATION BASED ON THE SOVA ALGORITHM

Carlos Arrabal Azzalini


Madrid, Abril 2007

PROYECTO FIN DE CARRERA

TURBO DECODER IMPLEMENTATION BASED ON THE SOVA ALGORITHM

Autor:

Carlos Arrabal Azzalini

Tutor:

Pablo Ituero Herrero

DEPARTAMENTO DE INGENIER ELECTRONICA IA ESCUELA TECNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACION UNIVERSIDAD POLITECNICA DE MADRID Madrid, Abril 2007

PROYECTO FIN DE CARRERA:

Turbo Decoder Implementation Based on the SOVA Algorithm

AUTOR: TUTOR:

Carlos Arrabal Azzalini Pablo Ituero Herrero

El tribunal nombrado para juzgar el Proyecto arriba indicado, compuesto por los siguientes miembros:

PRESIDENTE: VOCAL: SECRETARIO: SUPLENTE:

D. Carlos Alberto Lpez Barrio o Da. Mar Luisa Lpez Vallejo n a o D. Jos Luis Ayala Rodrigo e D. Gabriel Caarena Fernndez a

acuerdan otorgarle la calicacin de: o

Madrid

de

de 2007

El Secretario del Tribunal

To my parents

Acknowledgements
First of all I would like to thank Marisa for assigning this project and the scholarship to me. I have enjoyed working on it all along. I would like to give special thanks to my mentor and friend Pablo for his advices and support. I had great time working with him. Thanks to my friends at the Lab for the fantastic environment. Finally I would like to thank Sandra for all her support and patient and for being there all the time.

ii

Abstract
Today most common architectures for implementing the SOVA algorithm are aected by two parameters: the trace back depth and the reliability updating depth. These parameters play an important role in the BER performance, power consumption, area and system throughput trade-os. In this work, we present a new approach for doing the SOVA decoding that is not limited by the mentioned parameters and leads to an optimum SOVA algorithm execution. Besides, the architecture is achieved by recursive units which consume less power since the amount of employed registers is reduced. We also present a new scheme to improve the SOVA BER performance which is based on a approximation to the BR-SOVA algorithm. With this scheme the BER achieved is 0.1 dB from the one obtained with a Max-Log-Map algorithm.

iii

iv

Contents
1 Introduction 2 Turbo Codes 2.1 2.2 2.3 2.4 2.5 2.6 Binary Phase Shift Keying Communication System Model. . . . . . . . . . Soft Information and Log-Likelihood Ratios in Channel Coding. . . . . . . . Convolutional Encoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trellis Diagrams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5 5 7 8 9

Turbo Codes Encoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Trellis Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 13

3 Decoding Turbo Codes : Soft Output Viterbi Algorithm 3.1 3.2

Turbo Codes decoding process. . . . . . . . . . . . . . . . . . . . . . . . . . 13 SISO Unit: SOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 3.2.2 3.2.3 Viterbi Algorithm Decoding Example . . . . . . . . . . . . . . . . . 18 Soft Output extension for the VA. . . . . . . . . . . . . . . . . . . . 20 Improving the soft output information of the SOVA algorithm. . . . 23 25

4 Hardware Implementation of a Turbo Decoder based on SOVA 4.1 4.2 4.3 4.4 4.5 4.6

Turbo Decoder RAM buers. . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Interleaving/Deinterleaving unit of the turbo decoder . . . . . . . . . . . . . 28 SOVA as the core of the SISO. . . . . . . . . . . . . . . . . . . . . . . . . . 29 Branch Metric Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Add Compare Select Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Survival Memory Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6.1 4.6.2 4.6.3 4.6.4 4.6.5 Register Exchange Survival Memory Unit. . . . . . . . . . . . . . . . 34 Systolic Array Survival Memory Unit. . . . . . . . . . . . . . . . . . 35 Two Step approach for the Survival Memory Unit. . . . . . . . . . . 37 Other Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Fusion Points Survival Memory Unit. v . . . . . . . . . . . . . . . . . 38

4.7 4.8 4.9

Fusion Points based Reliability Updating Unit. . . . . . . . . . . . . . . . . 45 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 55 59

5 Methodology 6 Measures and Results 6.1 6.2 6.3 6.4 6.5

Quantization Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Bit Error Rate Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Throughput Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Power Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 71 72

7 Conclusions and future work Bibliography

vi

List of Figures
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Simplied communication system model. . . . . . . . . . . . . . . . . . . . . Discrete AWGN Channel NSC encoder of rate RSC encoder of rate
1 2 1 2

6 6 8 9 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

RSC encoder used in the UMTS standard. Pf b = [1011], Pg = [1101] . . . .

Trellis Example of an RSC encoder with Pf b = [111], Pg = [101] . . . . . . . 10 Serial concatenated Turbo encoder . . . . . . . . . . . . . . . . . . . . . . . 11 Parallel concatenated Turbo encoder. RSC encoder with Pf b = [111], Pg = [101]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Turbo Encoder with trellis termination in one encoder. Pf b = [111], Pg = [101]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Turbo Decoder generic scheme. . . . . . . . . . . . . . . . . . . . . . . . . . 13 Output during state transition for a given trellis. . . . . . . . . . . . . . . . 16 Trellis diagram for VA, Code given by Pf b = [111] , Pg = [101] . . . . . . . . 19 Soft Output extension example for the Viterbi Algorithm. Code given by Pf b = [111] , Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Hardware implementation of a turbo decoder . . . . . . . . . . . . . . . . . 26 Overall system states diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Data-in RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Data-out RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 RAM La/Le and RAM Le/La connections . . . . . . . . . . . . . . . . . . . 28 Interleaving/Deinterleaving Unit . . . . . . . . . . . . . . . . . . . . . . . . 29 Viterbi and SOVA decoder schemes . . . . . . . . . . . . . . . . . . . . . . . 30 BMU for the RSC encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Add Compare Select Unit for the SOVA. Pf b = [111], Pg = [101] . . . . . . 32

3.1 3.2 3.3 3.4

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9

4.10 Modular representation of the path metrics. Each path metric register has a width of nb bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.11 Merging of paths in the traceback. . . . . . . . . . . . . . . . . . . . . . . . 33 vii

4.12 Register Exchange SMU for the SOVA. Pf b = [111], Pg = [101] . . . . . . . 34 4.13 Register Exchange processing elements. . . . . . . . . . . . . . . . . . . . . 35 4.14 Systolic Array for the Viterbi Algorithm. . . . . . . . . . . . . . . . . . . . 36

4.15 Survival unit for the Systolic Array. . . . . . . . . . . . . . . . . . . . . . . 37 4.16 Two Step idea. First tracing back, and then reliability updating. . . . . . . 37 4.17 Fusion Points based SMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.18 Possibility of fusion points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.19 Fusion Point detection algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 40 4.20 Sequence of the Fusion Point algorithm . . . . . . . . . . . . . . . . . . . . 41 4.21 FPU architecture for a code with constraint length K = 3. . . . . . . . . . . 43 4.22 Reliability updating problem . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.23 One possible solution to the problem of bit reliabilities releasing. . . . . . . 46 4.24 Solution adopted for the bit reliabilities releasing problem. . . . . . . . . . . 46 4.25 Fusion Points based Reliability updating unit . . . . . . . . . . . . . . . . . 48 4.26 Recursive Updating Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.27 Recursive Updating Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.28 Control Unit General Scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.29 Control Unit State Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.30 Reliability Updating Unit with BR-SOVA approximation . . . . . . . . . . 53 4.31 Recursive Update with BR-SOVA approximation . . . . . . . . . . . . . . . 54 5.1 5.2 5.3 6.1 6.2 6.3 6.4 6.5 6.6 6.7 Project Work Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Hardware-in-the-loop approach . . . . . . . . . . . . . . . . . . . . . . . . . 57 Hardware-in-the-loop verication procedure . . . . . . . . . . . . . . . . . . 57 quantization eect on the system BER performacne. BR-SOVA approximation scheme. Simulation with quantication. MCF. Pf b = [111], Pg = [101] 60 HR-BRapprox comparison. Innite precision simulations. MCF interleaver. Pf b = [111], Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 HR-SOVA HIL results. MCF interleaver. Pf b = [111], Pg = [101] . . . . . . 63 BR-SOVA approximation HIL results. MCF interleaver. Pf b = [111], Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 HR-BRapprox HIL comparison. MCF interleaver. Pf b = [111], Pg = [101] . 64 HR-BRapprox comparison. Innite precision simulations. RAND interleaver. Pf b = [1011], Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . 65 BR-SOVA approximation HIL results. RAND interleaver. Pf b = [1011], Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 viii

6.8 6.9

Throughput statistics. f = 25M Hz, fRU U = 25M Hz. Pf b = [111], Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Throughput statistics. f = 25M Hz, fRU U = 50M Hz. Pf b = [111], Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.10 Throughput statistics. f = 16.66M Hz, fRU U = 25M Hz. Pf b = [111], Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.11 Throughput statistics. f = 25M Hz, fRU U = 25M Hz. Pf b = [1011], Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.12 Throughput statistics. f = 25M Hz, fRU U = 50M Hz. Pf b = [1011], Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.13 Throughput statistics. f = 16.66M Hz, fRU U = 50M Hz. Pf b = [1011], Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

ix

Chapter 1

Introduction
The goal of any communication system is to achieve highly reliable communications with a reduced transmitted power and reach as high as possible data rates. All these parameters usually represent a trade-o that designers have to deal with. Bandwidth is also a limited resource in communication systems. Error-detecting and error-correcting techniques are used in digital communication systems in order to get higher spectral and power eciencies. This is based on the fact that with these techniques more channel errors can be tolerated and so the communication system can operate with a lower transmitted power, transmit over longer distances, tolerate more interference, use smaller antennas, and transmit at a higher data rates. One of the most widespread of these techniques is Forward Error Correction (FEC). On the transmitter side, an FEC encoder adds redundancy to the data in the form of parity information. Then at the receiver, an FEC decoder is able to exploit the redundancy in such a way that a reasonable number of channel errors can be corrected. Claude Shannon father of Information Theory showed that if long random codes are used, reliable communications can take place at the minimum required Signal to Noise Ratio (SNR). However, truly random codes are not practical to implement. Codes must possess some structure in order to have computationally tractable encoding and decoding algorithms. Turbo Codes were introduced by Berrou, Glavieux and Thitimajshima in 1993 [3]. These codes exhibit an astonishing performance close to the theoretical Shannon limit, in addition to a good feasibility of VLSI (Very Large Scale Integration) implementation. Turbo Codes are used in the two most widely adopted third-generation cellular standards (UMTS and CDMA2000). They are also incorporated into standards used by NASA for deep space communications (CCSDS) and digital video broadcasting (DVB-T). Decoding in Turbo Codes is carried out by a soft-output decoding algorithm: an algorithm that provides a measure of reliability for each bit that it decodes. Specically two of the component decoding algorithms that are used in Turbo Codes are known as MAP (Maximum a Posteriori) and SOVA (Soft Output Viterbi Algorithm). The high computational complexity of the MAP algorithm makes its implementation expensive and power-hungry. This is why most implementations perform a simplied version of the algorithm. The most common simplications are: the Log-MAP and Max-Log-MAP algorithms which work in the logarithmic domain. Regardless, these algorithms are still more complex and power-hungry compared to the SOVA algorithm which presents the

Introduction

drawback of a worse BER (Bit Error Rate) performance. This work deals with a SOVA algorithm implementation. Today most common architectures for implementing the SOVA algorithm are aected by two parameters: the trace back depth and the reliability updating depth. These parameters play an important role in the BER performance, power consumption, area and system throughput trade-os. In this work, we present a new approach for doing the SOVA decoding that is not limited by the mentioned parameters and leads to an optimum SOVA algorithm execution. Besides, the architecture is achieved by recursive units which consume less power since the amount of employed registers is reduced. We also present a new scheme to improve the SOVA BER performance. With this scheme the BER achieved is 0.1 dB from the one obtained with the Max-Log-Map algorithm. The design was implemented on a low cost Spartan III FPGA (Field Programmable Gate Array). The system was tested for two major polynomials and the system BER was measured for dierent SNR input messages. Also throughput measures were taken while power estimations were carried out by simulations. The key points of this work can be summarized in the following list: A complete Turbo Decoder implementation based on the SOVA algorithm has been achieved: A two step approach for the SOVA decoding has been adopted [9]. A new algorithm that does not depend on the trace back depth of the survival path has been introduced for the SOVA decoding. A new architecture for the previous algorithm has been designed. A new architecture for updating bit reliabilities according to the HR-SOVA algorithm has been designed. A novel updating process that approximates the BR-SOVA algorithm for binary RSC codes has been presented. With this scheme the BER performance is less than 0.1dB from the the Max-Log-Map approach. The system has been described with generic VHDL code. The system has been highly tested. BER curves have been measured for the HR-SOVA and the BR-SOVA approximation with dierent codes. (Real System). Throughput estimations have been obtained for dierent codes. (Real System). Power estimations have been obtained with simulation tools. (VHDL PostPlace and Route model). The structure of this document is the following. The second chapter introduces Turbo Codes and sets the environment where this work resides. The third chapter deeply describes the SOVA algorithm and sets the main ideas for the fourth chapter which describes today most common architectures and introduces the SOVA implementation proposed in this work. It is inside the fourth chapter where the new algorithm, in conjunction with the new architectures, is presented. The fth chapter illustrates the practical design, from

implementation to verication. Finally the sixth chapter presents the results and measures carried out on the real system while the seventh chapter gives the conclusions and establishes the basis of the future work.

Introduction

Chapter 2

Turbo Codes
Turbo Codes were presented by Glavieux [3] in 1993. They had a tremendous impact in the discipline of channel coding. They are, along with LDPC (Low Density Parity Check) codes, the closest approximation ever to the code that Claude Shanon probed to exist in the mid XX century and which is able to achieve error free communications. Since their introduction, they have been intensively studied. The rst commercial application was presented in 1997 [1] and today they are already part of the UMTS (Universal Mobile Telecommunication System) standards. They have become the rst choice when working with low SNRs (Signal to Noise Ratio) such as in wireless applications and deep space communications. In this chapter we rst introduce the communication system model which has been employed in this work as the scenario for channel coding tests. Next we introduce the concept of soft information which is the key of Turbo Codes. We describe the Turbo Codes encoders and nally we talk about the trellis termination. The decoding process is left to the next chapter.

2.1

Binary Phase Shift Keying Communication System Model.

In order to explain the soft information concept and the log-likelihood ratio, we will develop a simplied communication model that will be the base example for the proceeding concepts. This communication model is shown in Figure 2.1. On the transmitter side there is a source of information that we assume to provide equally likely symbols. There is a block for channel coding which is the main subject of this work and it is carried out by a Turbo Code. The modulation scheme is BPSK (Binary Phase-Shift Keying) and the channel is assumed to be an AWGN (Additive White Gaussian Noise). On the receiver side, all the complement blocks for those in the transmitter are found. Also, there is a matched lter which maximizes the SNR before sampling the received data. Note that we have omitted the synchronization recovery subsystem which will be assumed to be ideal.
1 As starting point the source provides message bits mi at a rate of T bits/sec, which are fed into the channel coding block. In a Turbo Code context, these bits are grouped to form a frame of size L bits. The channel coding block outputs a coded frame with size 2L. So, for each message bit mi there is a symbol made of two bits xi = {xsi , xpi }. Then

Turbo Codes

+1V

-1V

mi

source

01011010...

Channel Coding

110010...

BPSK Modulation

+1V

-1V

sink

01011010...

Channel Decoding

Matched Filer

mi

yi
+1V Implies sampling and quantification

-1V

Figure 2.1: Simplied communication system model.

xi

AW GN C hannel

M a t c h e d F il e r

D is c r e t e A W G N c h a n n e l

Figure 2.2: Discrete AWGN Channel the code rate is r = 1 one input bit, two output bits. The modulator generates the 2 waveform signals from the input bits and transmits them through the AWGN channel. The matched-lter lters the received signals which, at the corresponding time instant, are sampled and so the yi symbols are obtained. The AWGN channel, in conjunction with the matched lter and the sampling unit, can be modeled as a discrete AWGN channel as shown in gure 2.2. The modeling of a discrete channel is desired, since computer simulations are simplied and the computing time is reduced. The equation that governs the behavior of this channel is the following: yi = a Es (2xi 1) + nG (2.1)

where a is a fading amplitude which is assumed to be 1. If a fading channel was under the scope of study, then a would be assumed to be a random variable with a Rayleigh distribution. Es is the energy of the transmitted symbol and it relates to the energy per bit of information as Es = rEb. Finally nG represents the white Gaussian noise with zero mean and a power spectral density of N0 . For simulation purposes equation 2.1 is 2 rewritten as: yi = a (2xi 1) + n G where the variance of nG becomes 2 =
N0 2Es .

) t( u

AWGN Channel

r ( t)

yi

(2.2)

2.2 Soft Information and Log-Likelihood Ratios in Channel Coding.

2.2

Soft Information and Log-Likelihood Ratios in Channel Coding.

Whenever a symbol yi is received at the decoder, the following test rule helps us to determine what the transmitted symbol was, based only on the observation yi and without the help of the code. P (xi = 1 | yi ) > P (xi = 0 | yi ) xi = 1 P (xi = 1 | yi ) < P (xi = 0 | yi ) xi = 0 This rule is known as MAP (Maximum a posteriori) since P (xi = 1 | yi ) and P (xi = 0 | yi ) are the a posteriori probabilities. Using the Bayes theorem, the previous rule can be rewritten as: P (yi | xi = 1) P (xi = 1) P (yi ) P (yi | xi = 1) P (xi = 1) P (yi ) > < P (yi | xi = 0) P (xi = 0) xi = 1 P (yi ) P (yi | xi = 0) P (xi = 0) xi = 0 P (yi )

and rewriting equations as ratios, yields: | xi | xi | xi | xi

P P P P

(yi (yi (yi (yi

= 1) P = 0) P = 1) P = 0) P

(xi (xi (xi (xi

= 1) = 0) = 1) = 0)

> 1 xi = 1 < 1 xi = 0

If we apply the natural logarithm on the previous equations, the testing result is not altered, then we obtain: | xi | xi | xi | xi

P P P ln P ln

(yi (yi (yi (yi

= 1) P + ln = 0) P = 1) P + ln = 0) P

(xi (xi (xi (xi

= 1) = 0) = 1) = 0)

> 0 xi = 1 < 0 xi = 0

The previous ratios in the log domain, are the LLR (Log Likelihood Ratio) metrics which is a useful way to represent the soft decision of receivers or decoders. We can summarize the previous steps with only one equation as follows: L (xi | yi ) = L (yi | xi ) + L (xi )
P (xi =1) where L (xi | yi ) = ln P (xi =1|yi ) , L (yi | xi ) = ln P (yi |xi =1) and L (xi ) = ln P (xi =0) . The P (xi =0|yi ) P (yi |xi =0) notation of the previous equation is usually rewritten as:

i = Lc (yi ) + Lai where Lai is the LLR of the a priori information and Lc (yi ) is related to a measure of the channel reliability. Note that the sign of the i indicates the hard decision.

Turbo Codes

1 mi 1

1
2g

Figure 2.3: NSC encoder of rate

1 2

So far we have introduced the equations of soft information based on the received symbol at the input of the decoder without the aid of the underlying code. The fact of using channel coding in the communication system lets us improve the LLR of the a posteriori probability. This is shown in [3]. The LLR of the a posteriori information at the output of the decoder is: i = i + Lei = Lc (yi ) + Lai + Lei (2.3)

The term Lei is known as the extrinsic information which actually is the improvement achieved by the decoder and the decoding process on the soft information. The extrinsic information will be the data fed as a priori information to the other decoder in a concatenated decoding scheme. It is important to remark that all terms in equation 2.3 can be added because they are statistically independent [3]. Statistical independence of terms is essential to allow iterative decoding and this is the reason of interleavers in the concatenation schemes of Turbo Encoders and Turbo Decoders.

2.3

Convolutional Encoders.

Turbo Codes encoders are mainly based on convolutional encoders. In these encoders the output signals are typically generated convolving the input signal with itself in several dierent congurations, consequently adding redundancy to the code. Convolutional codes can be either Non-systematic Convolutional codes (NSC) when the input word is not among the outputs; or Recursive Systematic Convolutional codes (RSC) when the input word is one of the outputs [8]. Figure 2.3 illustrates an example of a NSC encoder while gure 2.4 shows an RSC encoder. A set of registers and modulo two adders can be appreciated on the gures. The connections among those registers and the modulo two adders determine the output sequence of the encoder. Dividing the number of inputs I I over the number of outputs O results in the code rate O . The cited examples through all 1 this work will always use an RSC encoder with rate 2 . To dene a convolutional encoder we need a set of polynomials which represent the connections among the registers and the modulo two adders. For an NSC two code gener-

111

1 01
=[

p 1 p 2

1g

x x P

=[

2.4 Trellis Diagrams.

Figure 2.4: RSC encoder of rate


mi

1 2

Figure 2.5: RSC encoder used in the UMTS standard. Pf b = [1011], Pg = [1101]
1 ator polynomials dene the encoder of rate 2 see gure 2.3. On the other side, an RSC encoder is dened by both feedback and generator polynomials see gure 2.4.

The status of the set of registers represents the state of the encoder. Input bits mi make the encoder memory elements change and move into another state while producing the output bits xsi , xpi for the case of the RSC encoder. Convolutional encoders are characterized by the constraint length K. An encoder with constraint length K has K 1 memory elements which allows the encoder to jump through 2K1 states. RSC encoders are mostly used in Turbo Codes schemes rather than NSC encoders, since better BER performances have been achieved with them. For instance, the encoder used in UMTS is the one depicted in Figure 2.5.

2.4

Trellis Diagrams.

A trellis diagram is a graphical representation of the states of the encoder. It is a powerful tool since not only allows us to see state transitions, but also their time evolution. The MAP (Maximum a posteriori Probability) and the SOVA (Soft Input Soft Output) algorithms are used to decode Turbo Codes. They base their calculations on the trellis branches in order to reduce computing and this is the reason why we explain trellis diagram.

101

=[

mi

111

=[

bf

P
1 gP

10

Turbo Codes

m = <110...>

x=<11 10 00 ...>

s0
{1,1} {1,1}

{0,0}

s1 s2 s3

{0,0}

{1,0} {0,1} {0,1} {1,0}

i=0 m i =0 m i =1

i=1

i=2

i=3

Figure 2.6: Trellis Example of an RSC encoder with Pf b = [111], Pg = [101] Figure 2.6 shows the trellis for the RSC encoder of gure 2.4. The gure also shows an example of an input message and how this input message represents a path in the trellis diagram. This path is colored in blue an it is known as the state sequence s. In order to nd the trellis representation of an encoder we follow these steps: The trellis will have 2K1 states, at each time instant. The memory elements of the encoder are set to represent a given state. Usually the rst state is 0. Then we want to calculate the connections between the present state and the subsequent states. An input bit mi equal to zero is assumed. Then the output symbol is calculated by operating with the adders and the value of the registers. Also the next state is calculated by shifting the register inputs at the clock edge. For example in gure 2.6, we see that at state s0 , an input message bit mi = 0 produces a transition to state s0 . In contrast, a bit mi = 1 produces a transition to state s2 . An input bit mi equal to one is assumed. Again, the output symbol is calculated by operating with the adders and the value of the registers. Also the next state is calculated by shifting the register inputs at the clock edge. Note that, whenever a transition is due to a zero input bit, then that transition is drawn as a solid line. In contrast, whenever the transition is due to an one input bit, that transition is drawn as a dashed line. Repeat the previous steps with the rest of the states, s1 -s3 in the example. The previous trellis diagram is given by the polynomials and therefore it is the same for all the stages. The encoded message can be thought as a particular path within the trellis diagram as shown in the example of 2.6.

2.5 Turbo Codes Encoders.

11

RSC E ncoder

In te r le a v e r

RSC E ncoder

xi

Figure 2.7: Serial concatenated Turbo encoder


s

mi

x
i

x
x x
i i

Figure 2.8: Parallel concatenated Turbo encoder. RSC encoder with Pf b = [111], Pg = [101].

2.5

Turbo Codes Encoders.

As we mentioned in 2.3. Turbo Codes encoders are mainly based on convolutional encoders. However Turbo encoders also include one or more interleavers for shuing data. Figure 2.7 shows a serial concatenated Turbo encoder, while gure 2.8 shows a parallel concatenated Turbo encoder of rate 1 which is the one used in our communication system 2 model. A lot of combinations can be achieved by concatenating dierent convolutional encoders with interleavers. The reason of the interleavers is to uncorrelate data streams, so at the decoder, an iterative decoding can take place. In gure 2.8, there is a block known as puncturer, which basically compounds the parity bit of the resulting encoder by selecting one parity bit from each convolutional encoder at a time. If no puncturing was done, then the rate of the entire Turbo encoder would be 1 the data rate of the 3 resulting Turbo encoder can be dierent from the rate of the convolutional encoder.

2.6

Trellis Termination

Before getting into the decoding process, it is important to mention the trellis termination of the convolutional encoders since it aects the BER performance of the code. The trellis

2p

1p

Interleaver

puncturing

12

Turbo Codes

s1
i

s2

1 1 1

1
i

x
i

Figure 2.9: Turbo Encoder with trellis termination in one encoder. Pf b = [111], Pg = [101]. termination is basically the nal state the memory elements of the convolutional encoders adopt when the end of the frame, being encoded, is reached. Since there is an interleaver between both convolution encoders, the trellis termination of them is not a trivial task [16]. We will choose, for the purpose of this work, to terminate the rst encoder and left the second encoder open. Figure 2.9 shows the resulting Turbo encoder. The system works as follows: At the beginning, switch s1 is closed and switch s2 is opened. A data frame of size L 2 is encoded and then switch s1 is opened and s2 is closed, the remaining two bits are encoded , this leads the rst convolutional encoder to the state 0. Note that the data frame, for this case L 2 bits long, and the remaining two bits are used to terminate the trellis.

2p

1p x

Interleaver

x
p

mi

Chapter 3

Decoding Turbo Codes : Soft Output Viterbi Algorithm


In this chapter we will introduce a general scheme of a Turbo decoder for a parallel concatenated code. We will go step by step through the entire decoding process and deeply describe one of the algorithms used in the SISO (Soft Input Soft Output) unit: The SOVA algorithm.

3.1

Turbo Codes decoding process.

In the previous chapter we presented Turbo Codes and the encoding process. Now it is time to talk about the decoding process. Turbo codes are asymmetrical codes. That is, while the encoding process is relatively easy and straight forward, the decoding process is complex and time consuming. The power of Turbo Codes resides on the decoding process which unlike others techniques, is done iteratively. Figure 3.1 shows a general scheme of a turbo decoder. As we can see, the decoding process is done by two SISO decoders. Signals arriving at the receiver are sampled and processed with the aid of the channel reliability before becoming the soft information parity info 1,2 and systematic info shown in gure 3.1. We can

DeInterleaver -

a priori Info

La

Interleaver

La

SISO
parity Info 1 systematic Info p s

parity Info 2

SISO
p s Interleaver

DeInterleaver

Decision

decoded bits

Figure 3.1: Turbo Decoder generic scheme.

14

Decoding Turbo Codes : Soft Output Viterbi Algorithm

see the output of one SISO decoder becoming the input of the other decoder and vice versa, forming a feedback loop. The name of turbo code is due to this feedback loop and its comparable appearance to a turbine engine. Final decoding is achieved by an iterative process. Soft input information is processed and as a result soft output information is obtained. The second decoder takes this soft information as input and produces new soft output information that the rst decoder will use as input. This process continues until the system makes a hard decision. The BER obtained improves drastically with the rst iterations until it begins to converge asymptotically [3]. A trade-o exists between the decoding delay and the bit error rate achieved. Even though eight iterations are enough to obtain a reasonable BER, decoders not always do them all; instead they check the parity of the message header and then they decide whether to keep iterating or not. Note that between each decoder there is an interleaver or deinterleaver depending on the data ow. As we mentioned in chapter 2, the interleaver/deinterleaver unit is a big issue in turbo coding. This unit reorders soft information so a priori data, parity data and systematic data are all time coherent at the moment of processing. Figure 3.1 also shows how soft input information is extracted from output, in order to avoid the positive feedback which degrades the BER performance of the system.

3.2

SISO Unit: SOVA.

Even though the SOVA algorithm and the MAP algorithm are both trellis based they take advantage of trellis diagram to reduce computations they dier in the nal estimation they obtain. MAP performs better when working with low SNR and both of them are about the same when working with high SNR. MAP nds the most probable set of symbols in a message sequence while SOVA nds the most probable message sequence of states associated to a path within a trellis. Nevertheless, MAP is computationally much heavier than SOVA. SOVA stands for Soft Output Viterbi Algorithm. Actually, it is a modication of the Viterbi Algorithm [7].We will introduce the Viterbi Algorithm based on the explanation given in [16] and then we will add the soft output extension. VA is widely used because it is useful to nd the most probable sequence within a trellis and we can use a trellis diagram to represent any nite state Markov process. Recalling our communication model, let s = (s0 , s1 , . . . sL ) be the sequence we want to estimate and let y be the received sequence of symbols. VA nds: s = arg maxP [s | y]
s

(3.1)

where y is the noisy set of symbols we have at the decoder after sampling. To be more precise y is the observation. From Bayes theorem we have: s = arg max
s

P [y | s] P [s] P [y]

(3.2)

since P [y] does not change with s, we can rewrite equation 3.2 as:

3.2 SISO Unit: SOVA.

15

s = arg maxP [y | s] P [s]


s

(3.3)

In order to compute equation 3.3, we could try all sequences s and nd the one that maximizes the expression. However, this idea it is not scalable when the frame size is too large. Since there is a rst order Markov process involved, we can take advantage of two of its properties to simplify the search for s. These properties are:

P [si+1 | s0 . . . si ] = P [si+1 | si ] P [yi | s] = P [yi | si si+1 ]

(3.4) (3.5)

Equation 3.4 establishes that the probability of next state does not depend on the entire past sequence. It only depends on the last state. Equation 3.5 states that the conditional probability of the observation symbol yi through white noise is only relevant during the state transition. Using these properties we can work on 3.3:
L1

P [y | s] =
i=0 L1

P [yi | si si+1 ] , P [si+1 | si ] ,


i=0 L1 L1

P [s] =

s = arg max
s i=0

P [yi | si si+1 ]
i=0

P [si+1 | si ]

(3.6)

A hardware implementation of an adder requires less resources than a hardware implementation of a multiplier. So if we apply natural logarithm on 3.6 we can replace multiplications with additions without altering the nal result. Thus it yields:
L1

s = arg max
s i=0

ln P [yi | si si+1 ] + ln P [si+1 | si ]

(3.7)

Introducing (si si+1 ) = ln P [yi | si si+1 ] + ln P [si+1 | si ], we can rewrite equation 3.7 as:
L1

s = arg max
s i=0

(si si+1 )

(3.8)

(si si+1 ) is known as the branch metric associated with transition si si+1 . The observation yi during state transition si si+1 is actually the output of the encoder observed through white noise during the state transition. For our BPSK model this

16

Decoding Turbo Codes : Soft Output Viterbi Algorithm

xsi
mi

(usi,upi)= (-1,-1)
s0 s0

) ,1 (1

s1

s1

s2

s2

BPSK Modulation u = 2x 1 u
i i

xpi

= 2x 1
p

Figure 3.2: Output during state transition for a given trellis.

observation is related to the systematic and parity bit pair (Figure 3.2). Thus, assuming noise independence, we can express the conditional probability of yi during state transition as follows:

P [yi | si si+1 ] = P [ysi | usi ] P [ypi | upi ]

where usi and upi are the systematic and parity bits respectively after BPSK modulation and P [ysi | usi ] =
1 2 1 exp 2 ysi usi 2

since we are dealing with white Gaussian noise with 2 variance. In addition, it is more convenient to express P [si+1 | si ] in terms of the message bit mi since state transitions are due to this bit. Then,

This is our a priori probability. For turbo decoding it is easier to work with log-likelihood ratios, then:

Lai = ln P [mi ] =

P [mi = 1] P [mi = 0]
eLai 1+eLai 1 1+eLai

It is important to remark that for the rst iteration, all message bits are assumed to be equally likely, then P [mi = 1] = P [mi = 0] = 0.5 Lai = 0. For successive iterations Lai is the extrinsic information provided by the other decoder through the interleaver. Replacing equation 3.9 and the above expression in the branch metric equation, we have:

s3

s3

(3.9)

dysi ; P [ypi | upi ] =

1 2

exp 1 2

ypi upi

dypi ,

P [si+1 | si ] = P [mi ]

(3.10)

mi = 1 ln P [mi] = Lai mi ln 1 + eLai mi = 0

3.2 SISO Unit: SOVA.

17

(si si+1 ) = ln

dysi dypi 1 (ysi usi )2 1 (ypi upi )2 + Lai mi ln 1 + eLai 2 2 2 2 2 2 1 = 2 (ysi usi )2 + (ypi upi )2 + Lai mi 2 1 2 2 = 2 ysi 2ysi usi + u2i + ypi 2ypi upi + u2i + Lai mi s p 2 1 [ys us + ypi upi ] + Lai mi = 2 i i

Note that in order to simplify equations we have neglected terms that do not change when N0 varying sequence s. From chapter 2 we know that 2 = 2Es and Es = rEb where r = 1 2 is the code rate. So nally we obtain:

(si si+1 ) =

Eb [ys us + ypi upi ] + Lai mi N0 i i

(3.11)

It is more common to express equation 3.11 as shown below, since channel reliability Es Lc = 4a N0 ( a = 1 for our model ),

(si si+1 ) = Lc ysi xsi + Lc ypi xpi + Lai mi then 3.8 becomes:
L1

(3.12)

s = arg max
s i=0

Lc ysi xsi + Lc ypi xpi + Lai mi

(3.13)

where xsi , xpi are the raw bits at the output of the channel encoder before the BPSK modulation. Also mi = xsi for our RSC encoder. It is important to remark that according to [11], for the SOVA algorithm, Lc can be assumed to be equal to 1. This means that there is no need to estimate the SNR of the channel. This is possible because at the beginning of the decoding process, at the rst iteration Lai = 0 which leads the resulting extrinsic information to be weighted by Lc . This extrinsic information becomes Lai for the next SISO decoder making all terms in equation 3.13 to be weighted by Lc . Therefore Lc has no inuence in the decoding process. The fact that the SOVA does not need the channel estimation saves a lot of diculties and represents a big advantage over the MAP algorithm. Summarizing, table 3.1 shows the relevant equations for applying the SOVA algorithm.

18

Decoding Turbo Codes : Soft Output Viterbi Algorithm

Element Branch Metric

Equation (si si+1 ) = ysi xsi + ypi xpi + Lai mi (3.14)

L1

Sequence Estimator

s = arg max
s i=0

ysi xsi + ypi xpi + Lai mi

(3.15)

Where {xsi , xpi } is the encoder output symbol when the input message bit is mi ; {ysi , ypi } is the received symbol, when the encoder output symbol is BPSK-modulated and transmitted through an AWGN channel. Finally Lai represents the LLR of the message bit mi .

Table 3.1: Equations summary.

In the next subsection we will develop an example in order to show how expression 3.15 and the trellis diagram are applied in the decoding process.

3.2.1

Viterbi Algorithm Decoding Example

Figure 3.3 shows a trellis diagram example for a code with Pf b = [111], Pg = [101], and tries to clarify the decoding process.

As shown on gure 3.3.a, The process begins at time i = 0 from state 0 because that is the state the encoder takes when initialized. Thus, the probability of being at state 0 is one, and probability of being at any other state is zero. We assign these probabilities, as path metrics in log domain, to each state:

pmi,k pm00 = 0 pm0k = k = 0

Then, the branch metrics are computed at each state for message bit 0,1 and corresponding parity bit.

3.2 SISO Unit: SOVA.

19

s0 s1 s2 s3

( s00 s10 )

( s01 s10 )

( s02 s11 )

( s03 s11 )

i=0

i=1

i=2

i=3

i = L-2

i = L-1

i=L

(a) Computing branch metrics

s0 s1 s2 s3

i=0

i=1

i=2

i=3

i = L-2

i = L-1

i=L

(b) Surviving branches

s0 s1 s2 s3

( s10 s20 )

( s11 s20 )

i=0

i=1

i=2

i=3

i = L-2

i = L-1

i=L

(c) Continuing at i=1

s0 s1 s2 s3
i=0 i=1 i=2 i=3 i = L-2 i = L-1 i=L

m = 101 10

Figure 3.3: Trellis diagram for VA, Code given by Pf b = [111] , Pg = [101] .

survival path

(d) Tracing back from last state.

20

Decoding Turbo Codes : Soft Output Viterbi Algorithm

si,k si+1,k

(s0,0 s1,0 ) = (ysi + Lai ) 0 + ypi 0 (s0,0 s1,2 ) = (ysi + Lai ) 1 + ypi 1 (s0,1 s1,2 ) = (ysi + Lai ) 0 + ypi 0 (s0,1 s1,0 ) = (ysi + Lai ) 1 + ypi 1 (s0,2 s1,3 ) = (ysi + Lai ) 0 + ypi 1 (s0,2 s1,1 ) = (ysi + Lai ) 1 + ypi 0 (s0,3 s1,1 ) = (ysi + Lai ) 0 + ypi 1 (s0,3 s1,3 ) = (ysi + Lai ) 1 + ypi 0

The incoming path metrics for each state at time i = 1 are calculated by adding the incoming branch metrics to the corresponding path metrics of states at time i = 0. Figure 3.3.b. For each state at time i = 1, the incoming branch with the greater incoming path metric is kept. The new path metrics of these states are the survival incoming path metrics. pm1,0 = max (pm0,0 + (s0,0 s1,0 ) , pm0,1 + (s0,1 s1,0 )) pm1,1 = max (pm0,3 + (s0,3 s1,1 ) , pm0,2 + (s0,2 s1,1 )) pm1,2 = max (pm0,1 + (s0,1 s1,2 ) , pm0,0 + (s0,0 s1,2 )) pm1,3 = max (pm0,2 + (s0,2 s1,3 ) , pm0,3 + (s0,3 s1,3 )) In gure 3.3.b, 3.3.c the survival branches are drawn thicker. This algorithm is repeated from item 2 until time i = L 1. Note that the nal states will be at i = L. In order to nd s at this point, there are two possibilities: if the encoder was terminated, the system should trace back from the state at which the encoder was terminated usually state 0 through all survival linked branches. If the encoder was not terminated, the system should choose the state with the greater path metric and trace back from there. Each branch within the trellis has a message bit mi associated. The set of those bits is the most probable message. This step is shown in gure 3.3.d while the survival path is colored in green.

3.2.2

Soft Output extension for the VA.

The Viterbi Algorithm is able to nd the most probable sequence within the trellis and hence its associated bits. Turbo coding techniques also demand the SISO unit to supply soft output information. There are two well-known extensions for the Viterbi Algorithm that produce soft output [11]. One was proposed by Battail [2] and it is known as BRSOVA. The other one was proposed by Hagenauer [7] and it is known as HR-SOVA. The latter is mostly used rather than the former, even though the BR-SOVA performs better in terms of BER. However, HR-SOVA allows an easier hardware implementation. We will explain the HR-SOVA extension and remark the main idea.

3.2 SISO Unit: SOVA.

21

Soft output information represents a measure of the bit reliabilities. As a starting point for the algorithm, a reliability of innity is assumed for every bit in the frame, thus i = i. The remaining steps proceed as follows: As shown in the example of gure 3.4.a, at time i = L and state k = 0 the trace back of the survival path starts. The survival path has been colored in green as exhibited in the legend of the gure. In order to nd the bit reliabilities, the competing path also needs to be traced back from time i = L and state k = 0 to the time it merges with the survival path. This competing path has been colored in orange, and for the example of gure 3.4.a, the time where both paths merge is im = L 4. Also the dierence between both incoming path metrics at time i and state k has to be found. In gure 3.4 this value is represented as: i,k = pmi1,k + si1,k si,k pmi1,k + si1,k si,k (3.16)

where k is the next state of k and k , for a message bit mi {0, 1} respectively. See Figure 3.4.a for references. Let j be a new time index in the range im < j i. At every time instant j, the system compares the message bit of the survival path with the message bit of the competing path. If they dier then the reliability j has to be updated according to j min (j , i,k ) (3.17)

In gure 3.4 a red square is placed on the branches that dier in the message bit. The BR-SOVA has also an updating rule for the case where the message bit of the survival path do not dier to the message bit of the competing path: j min j , i,k + c j (3.18)

This is the main dierence between HR-SOVA and BR-SOVA. Nevertheless this updating rule implies the knowledge of the bit reliabilities of the competing paths c [11]. j Once the system reaches the state where the survival path and the competing path merge, it moves one time instant back from i to i 1 through the survival path and traces back once again the competing path at that state. This process is shown in gure 3.4.b. For the example, the system now starts at time i = L 1 and the corresponding state k = 0. For this case, the competing path and the survival path now merge at time im = L 5. This algorithm continues from step 2 until time i = 1, thus allowing all the bit reliabilities to be updated. Figure 3.4.c shows one more iteration with the aim of clarifying this process. Finally soft output information is obtained in terms of LLR(Log-Likelihood Ratio) as follows: i = (2mi 1) i 0iL1 (3.19)

22

Decoding Turbo Codes : Soft Output Viterbi Algorithm

s0 s1 s2 s3
survival path competing path i = L-5 i = L-4 i = L-3 i = L-2

i = L-1

i=L

(a) Survival path and competing path at time i=L, state k=0
0 ,1 L

s0 s1 s2 s3
survival path competing path i = L-5 i = L-4 i = L-3 i = L-2

i = L-1

i=L

(b) Survival path and competing path at time i=L-1, state k=0
1,2 L

s0 s1 s2 s3
survival path competing path i = L-5 i = L-4 i = L-3 i = L-2 i = L-1 i=L

(c) Survival path and competing path at time i=L-2, state k=1

Figure 3.4: Soft Output extension example for the Viterbi Algorithm. Code given by Pf b = [111] , Pg = [101] .

where mi is the estimated message bit mi {0, 1}. Note that (2mi 1) only gives the sign to i ; its magnitude is provided by i . After explaining the previous algorithm, it is important to remark the main idea of the process. At a given time 0 i L 1 the question to ask is: How reliable is the message bit mi ? The extension for soft output indicates that, the correctness of bit mi can only be

0, L

'

''

k k

im

3.2 SISO Unit: SOVA.

23

as good as the decision to choose the closest competing path over the most likely path.

3.2.3

Improving the soft output information of the SOVA algorithm.

The soft output generated by the HR-SOVA turned out to be overoptimistic [12]. It means that the HR-SOVA algorithm produces a LLR that is greater in magnitude than the LLR produced by the BR-SOVA or by the MAP algorithm. These overoptimistic values for the LLR lead HR-SOVA to a worse performance in terms of BER. In [12] two problems associated with the output of the HR-SOVA are described. One is due to the correlation between extrinsic and intrinsic information when the HR-SOVA is used in a turbo code scheme. The other problem is due to the fact that the output of the HR-SOVA is biased. The rst problem is not easy to solve, and most of the hardware implementations do not deal with it. In contrast, for the second problem there have been several proposals that are based on a normalization method. The idea behind a normalization method can be shown by assuming that the output of the HR-SOVA, given a message bit mi , is a random variable with a Gaussian distribution, then: P [i | mi = 1] = P [i | mi = 0] = 1 (i )2 exp 2 2 2 1 (i + )2 exp 2 2 2 di , di , (3.20) (3.21)

where is the expectation of i and = E 2 2 is the standard deviation. In i order to nd the LLR of the message bit mi , given the output of the HR-SOVA, we can dene: P [mi = 1 | i ] i = ln , (3.22) P [mi = 0 | i ] using Bayes theorem, assuming P [mi = 1] = P [mi = 0], and working on the previous expression with 3.20 and 3.21, yields: i= 2 i 2 , (3.23)
2 2

which indicates that the HR-SOVA output should be multiplied by the factor c = obtain the LLR.

to

The factor c, according to [12], depends on the BER of the decoder output. Some schemes try to estimate factor c while others set up a xed value for it. In our hardware implementation we will use a xed scaling factor since in [10], it has been reported that the BER performance by a xed scaling factor is better than by a variable scaling factor.

24

Decoding Turbo Codes : Soft Output Viterbi Algorithm

Chapter 4

Hardware Implementation of a Turbo Decoder based on SOVA


In the previous chapter we introduced the general ideas of a turbo decoder and presented the HR-SOVA algorithm from now on we will refer to it just as SOVA as the active part of the SISO unit. In this chapter we will deal with the implementation issues and analyze today most commonly used hardware architectures. Next we will introduce a new algorithm for nding points of the survival path and consequently we will present the architecture for implementing it. We will describe the unit that updates bit reliabilities and nally we will present the improvements which allow the decoder to boost the BER performance. As a general scheme we present the gure 4.1. There are two blocks of RAM used as input and output buers. There are also two more blocks of RAM used to store temporary data as a priori and extrinsic information. Then there is a unit that deals with the interleaving process, a unit to control the system and to interact with the user and nally the SISO unit that implements the SOVA algorithm. Note that we only use one SISO unit. This is possible because of the fact that the interleaver/deinterleaver does not allow concurrent processing, so a frame has to be completed by one decoder before it can be processed by the other. For the proposed architecture, this processing is always done by the same decoder. Data arriving at the receiver is processed and fed into the data-in RAM buer, then a starting command is delivered to the control unit. The states the system goes through are shown in gure 4.2. The system starts to process the interleaved data rst, and at the last iteration, it ends up with the deinterleaved data . This is done this way, in order to save an access through the interleaver at the end of the decoding process which also saves power and allows a simpler control unit. However, the system has to wait until the entire frame is received, before decoding can take place. Even though the same unit is used as decoder 1 and decoder 0, its behavior changes slightly, depending on the role the unit is playing. We can summarize the following tasks for each role: SOVA unit is acting as decoder 1:

26

Hardware Implementation of a Turbo Decoder based on SOVA

When SOVA addresses data-in RAM buer, it addresses belong to the interleaved domain. Since it addresses belong to the interleaved domain, in order to get systematic data, it has to go through the deinterleaver. It can address parity data 2 directly. If the rst iteration is running, then a priori information is assumed to be 0. Otherwise, it fetches a priori information through the deinterleaver from RAM La/Le. It writes extrinsic information directly to the RAM Le/La. This entails that, when acting as decoder 0, it has to access a priori information through the interleaver. SOVA unit is acting as decoder 0: It addresses belong to the deinterleaved domain, or the domain where information bits are in order. It can access systematic data and parity data 1 directly form the data-in RAM buer. The a priori information is accessed through the interleaver, since each word was written to an address in RAM Le/La, that belongs to the interleaved domain. It writes extrinsic information directly to the RAM La/Le. It writes hard output directly to the data-out RAM buer. This can be done at each iteration, allowing the user to check for a frame header, or when running the last iteration with the aim of saving power.

cmd

data in

RAM Data In

RAM La/Le

Control Unit

status

Interleaving/Deinterleaving Unit

RAM Le/La

SOVA

RAM Data Out

data out

Figure 4.1: Hardware implementation of a turbo decoder

4.1 Turbo Decoder RAM buers.

27

Idle

Begin? Y

+ Addresses belong to the interleaved domain. + It fetches parity data 2 directly from data-in RAM buffer. + It fetches systematic data through the deinterleaver. + It fetches a priori data through the deinterleaver from RAM La/Le. + It writes extrinsic information directly to RAM Le/La.

Deco 1

Deco 0
+ Addresses belong to the deinterleaved domain. + It fetches systematic data and parity data 1 directly from data-in ram buffer. + It fetches a priori data through the interleaver from RAM Le/La. + It writes extrinsic information directly to RAM La/Le. + It writes hard output directly to the output buffer.

Last Iteration? Y

Done

Figure 4.2: Overall system states diagram

4.1

Turbo Decoder RAM buers.

All the RAM buers are based on double port RAMs. The gure 4.3 shows the scheme of data-in RAM. Since the systematic data and parity data 2 belong to dierent time domains, two double port RAMs are used to store either information data. In gure 4.4 the scheme of the data-out RAM is shown. Finally gure 4.5 presents the RAM La/Le and the RAM Le/La, which are equivalent.

28

Hardware Implementation of a Turbo Decoder based on SOVA

syst p1 systematic parity 1 RAM addr in wr

sys p1

addr out sp1 rd sp1 p2 parity 2 RAM p2 addr out p2 rd p2

Figure 4.3: Data-in RAM

data-out RAM addr in wr addr out rd

Figure 4.4: Data-out RAM


data La RAM La/Le addr la/le rd la/le addr in we la/le data Le

data La RAM Le/La addr le/la rd la/le

data Le

addr in wr le/la

Figure 4.5: RAM La/Le and RAM Le/La connections

4.2

Interleaving/Deinterleaving unit of the turbo decoder

There have been several proposals to design an area ecient interleaver. In [14] contention free interleavers, that allow concurrent processing, are studied. In our case for the sake of simplicity and versatility a ROM is used to carry out the interleaving/deinterleaving functions as look up tables. Figure 4.6 shows the interleaving/deinterleaving unit. The gure also shows some control signals. The signal named deco indicates the role the SOVA unit is playing. Note that when working with deco=1, the address of the parity data 2 is

x
addr le

4.3 SOVA as the core of the SISO.

29

RAM p2 interface rd p2

RAM sp1 interface rd sp1 addr out sp1

RAM la/le interface rd lale

RAM le/la interface

RAM la/le le/la write port interface

rd lela

wr le/la wr la/le

deco

1
1 0 dmux

addr_la_1 mux
0 1

addr_la_0

deco

addr_le

deco wr le

addr out p2

Delay 1

Deinterleaver ROM

Interleaver ROM

rd dint rd int

addr_spla

Figure 4.6: Interleaving/Deinterleaving Unit delayed one cycle, while the address of the systematic data goes through the deinterleaver. Also the a priori data is fetched form the RAM La/Le and the extrinsic information is written directly to the RAM Le/La. In contrast, when working with deco=0, there is no need to access the parity data 2 RAM, since the parity data 1 and the systematic data are stored in the same RAM position. In this case, a priori information is accessed through the interleaver and extrinsic information is written directly to the RAM La/Le.

4.3

SOVA as the core of the SISO.

Before getting into our hardware implementation for the SOVA algorithm, it is important to comment some of today most commonly used hardware architectures. Since the SOVA algorithm is an extension of the Viterbi Algorithm, most of the main units have been based on the implementations achieved for the Viterbi Algorithm. This architectures are complemented with reliability updating units to produce the soft output. Figure 4.7, shows a comparison between Viterbi decoders and SOVA decoders. Both decoders have a BMU (Branch Metric Unit), an ACSU (Add Compare Select Unit), and an SMU (Survival Memory Unit). However the SOVA ACSU has to provide with the dierence between path metrics, and the SMU includes an RUU (Reliability Updating Unit) that provides the soft output information. In the next subsection we will discuss the issues related to the SOVA components.

30

Hardware Implementation of a Turbo Decoder based on SOVA

data in

VA

data in

SOVA

BMU

BMU

ACSU

ACSU SMU RUU


data out

SMU
data out

Figure 4.7: Viterbi and SOVA decoder schemes

4.4

Branch Metric Unit.

As it name suggests, this unit calculates the branch metrics. According to equation 3.14, the possible branch metrics depend on xsi , xpi and mi bits. When working with an RSC encoder of rate 1 , xsi = mi and there is only one parity bit xpi , which means that there 2 are four possible path metric at each time instant i: (xsi , xpi ) = (0, 0) 0 = 0 (xsi , xpi ) = (0, 1) 1 = ypi (xsi , xpi ) = (1, 0) 2 = ysi + Lai (xsi , xpi ) = (1, 1) 3 = ysi + Lai + ypi The BMU for an RSC encoder of rate 1 , is shown in gure 4.8. 2

4.5 Add Compare Select Unit.

31

ys

La

yp

Figure 4.8: BMU for the RSC encoder.

4.5

Add Compare Select Unit.

Applying equation 3.15 in the trellis diagram, yields the following expression: pmi,k = pmi1,k + si1,k si,k where k is the next state of k that produces the higher incoming path metric. The previous expression suggests that the path metric pmi,k can be obtained by recursion. In gure 4.9 an ACSU for the SOVA unit is presented. The set of registers holds the previous path metrics. The branch metrics are mapped to the corresponding adders according to the outputs during state transitions to produce the incoming path metrics. Then these incoming path metrics are connected to the selectors, which choose the higher incoming path metric and produce the decision vector along with the dierence between incoming path metrics. The connections between adders and selectors represent the trellis buttery. One problem that might arise is the overow of the path metrics after a certain amount of time. Since the relevant information is the dierence between path metrics, a normalization method can be adopted. There have been proposed many normalization methods since the introduction of Viterbi decoders. We nd the modulo technique reported in [13] to be a good solution, since it actually allows the overow. The idea behind the modulo technique is that the maximum dierence B between path metrics at all states is bounded. The gure 4.10 shows the mapping between all representable numbers, by the path metric register of nb bits, on a circumference. Let ipmi,k and ipmi,k be two incoming path metrics at a given time i state k, then it is shown in [13] that ipmi,k > ipmi,k , if ipmi,k ipmi,k > 0 in a two-complement representation context. The number of bits nb relates to the bound as follows: C = 2nb = 2B

0, 0

x , sx

)= ( )

1, 0

x , sx

)= ( )

1,1

x , sx

)= ( )

0 ,1

x , sx

)= ( )

32

Hardware Implementation of a Turbo Decoder based on SOVA

+ +

+ +

+ +

>0?
selector

Figure 4.9: Add Compare Select Unit for the SOVA. Pf b = [111], Pg = [101] This means that, even though the path metrics may grow in dierent ways, they all remain in the half of the representation space provided by C. An appropriate bound is B = 2nB, being n the minimum number of stages to ensure a complete trellis connectivity among all trellis states, and B is the upper bound on the branch metrics [13].

0 ,1

1, 0

(
k i

1, 0

0 ,1

k i

0, 0

) )

)
) )

1, 1

1, 1

0,0

(
k, i

)=

3, i

2,i

1, i

3, i v

2,i v

0, i

1, i v

0,i

mp

mp

0,1

+ +

0,i

mp

Sel Sel Sel Sel

4.6 Survival Memory Unit.

33

Figure 4.10: Modular representation of the path metrics. Each path metric register has a width of nb bits.
s0

s1

s2

s3 iFP i

Figure 4.11: Merging of paths in the traceback.

4.6

Survival Memory Unit.

The remaining SOVA units should obtain the soft output information for every bit in the frame along with the most likelihood path. One way to do so, is to store all the data the ACSU provides. Then when the last time instant is reached, the data is traced back and the bit reliabilities are updated according to the SOVA algorithm. However most of the hardware architectures do not do it that way because the latency is high and the amount of memory grows considerably with the frame size, the number of states of the encoder and the width of the quantization of i,k . Most of the SMUs take advantage of a trellis property to solve this problem. This property is illustrated in Figure 4.11, where a trellis diagram from a decoding process is shown. If all the paths are traced back from all the states at a given time i, it is found that they merge at time instant iF P . Therefore, from time instant iF P down to i = 1 the only path remaining in the trace

k , i mp i

k , i mp i

'

?0

bn

2 dom

k , i mpi

>

>

k, i'

m pi

k , i mp i

'

1 bn

1 bn

k,i

m pi

increasing

34

Hardware Implementation of a Turbo Decoder based on SOVA

i-U
PEU PEU PEU PE PE

i-D

1
1, i

1
2, i

1
3, i

Figure 4.12: Register Exchange SMU for the SOVA. Pf b = [111], Pg = [101] back started at time i, is the survival path. We dene the time instant along with the state where the paths merge as a FP (Fusion Point). Then, looking at the example of gure 4.11, for time instant i there is a FP at (iF P, s3 ). Simulations have shown that the distance between the time instant i and the FP iF P is a random variable. It is also observed that the probability of the paths merging increases with the depth of the trace back and it is proportional to the constraint length of the code. Then a trace back depth of 10 times the constraint length of the code, might allow the paths to merge with high probability. Below we will describe the mostly used architectures based on the previous property.

4.6.1

Register Exchange Survival Memory Unit.

1 The RE (Register Exchange) SMU for an RSC encoder of rate 2 is shown in gure 4.12. This scheme is reported in [9]. It is an array of PE (Processing Elements) of n rows and D columns n is the number of states of the encoder and D is the trace back depth. The connection topology between PEs is given by the trellis of the encoder. In gure 4.12, two types of PEs can be distinguished. The rst U PEs red outline , besides tracing back the paths, update the bit reliabilities. In gure 4.13.a a PE with updating capability is shown. In gure 4.13.b a normal PE is shown. The system allows the trace back of all the paths from the states at time instant i. The ACSU provides the data that enters the RE from the left. The rst U units update the bit reliabilities of each path according to the SOVA algorithm. Each row of the array holds the information of one path. For example, the rst row holds the path information corresponding to the path traced back from the state 0 at time i. The second row holds the information corresponding to the path traced back from state 1 at time i, and so on. After D clock cycles, if D is large enough to allow paths merging, the message bit and its reliability are obtained. Note, that if paths merge before D then the data coming out from rows, at all states, is the same, since the tails of all the paths belong to the survival path. Therefore only data from one row is selected.

Parameters U and D, represent a trade-o. Some architectures, set U in a range from two to ve times the constraint length of the code, while D is set between ve to ten times the constraint length of the code. If U and D are too large, the BER performance increases, so power consumption and area do. The area increasing, is also due to the resources spent in the connections, which becomes a serious problem with the number of states of the encoder. If U is large, and D is not, then resources are spent worthlessly since BER performance is not increased. The same if D is large while U is short or when

Di

0,D

XAM XAM

PEU

PEU

PEU

PEU

PE

PE

PE

Di

0,D

XAM XAM

PEU

PEU

PEU

PEU

PE

PE

PE

Di

0,D

XAM XAM

0, i

PEU

PEU

PEU

PEU

PE

PE

PE

Di

0,D

XAM XAM

1
0
0, iv

PEU

PE

1, i v 3, iv 2,i v

4.6 Survival Memory Unit.

35

' '

' ' '


a a>b? b

(a) PE with updating capability.

' '

' ' '


k, i

(b) Normal PE. Trace back only.

Figure 4.13: Register Exchange processing elements. both are short. The decoding latency for this scheme is D clock cycles and, as it can be observed, the pipelined style of the architecture suggests high activity and hence a relative high dynamic power consumption.

4.6.2

Systolic Array Survival Memory Unit.

The RE scheme, presents one major problem that leads to a high power consumption. The problem is that all the paths are traced back D steps. The idea behind the SA (Systolic Array) is to trace back only one path, however, this path, after D steps will merge with the survival path and will become the path we are looking for. SA is presented in [15].
1 The gure 4.14.a introduces the scheme of the SA for an RSC encoder of rate 2 and four states in the encoder. The gure only shows the SMU for the VA. It is composed of an array of elements arranged in n rows and 2D columns n is the number of states of the

''

'

k, i

k, i

''

'

'

'

v v v v

''

''

' '

v v v v

36

Hardware Implementation of a Turbo Decoder based on SOVA

i 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 1 0 1 0
0,

i-2D

(a) Systolic Array for the Viterbi Algorithm


iv

(b) Trace Back element of the Systolic Array

Figure 4.14: Systolic Array for the Viterbi Algorithm. encoder and D is the depth of the trace back. There is also one more row, with D TB(trace back) elements. The row of TB elements holds the sequence of the states belonging to the survival path. It can be observed that the connections between the elements in the array are much simpler than in the RE scheme. The system works as follows: the selection unit, feeds the decision bits vi,k provided by the ACSU into the left of the array. After D clock cycles, the SA is half full and the selection unit begins to feed the state si,k with the higher path metric accumulated in the ACSU registers into the left most TB element. The system also works if the selection unit feeds any other state. However the state with the higher path metric is more likely to be the survival path. Once the most likely state is fed, the TB elements along with the decision vectors, do the trace back of that state D more cycles. Figure 4.14.b shows the details of the TB cell. Finally after 2D cycles the SU(Survival Unit) Figure 4.15 provides the most likely message bit. Note that for this scheme the latency is twice the latency of the RE scheme, however, the trace back depth is only D. Note that this structure also suggests high activity and relatively high dynamic power consumption. So far the SA deals with the VA. The SOVA extension for the SA presents some major problems that were cited in [6]: SOVA requires path metrics dierences for every state, trace back must occur on two paths (survivor and competitor),

' , 1

Last State

k,D i

i s

2,

1,

0,

3,

iv

iv

iv

k, is

2,

1,

3,

i i i

v v v

Selection Unit

TB

TB

TB

TB

TB

TB

SU

4.6 Survival Memory Unit.

37

Figure 4.15: Survival unit for the Systolic Array.


Reliability updating s0 Trace back

s1

s2

s3 i-D-U survival path competing path i-D i

Figure 4.16: Two Step idea. First tracing back, and then reliability updating. each state must have access to all the information about the path metric dierences and decision vectors for that particular time. These issues make the SA not a good choice for a complete SOVA based decoder. However SA has been used in [17] as a reliability updating unit in a Two Step conguration.

4.6.3

Two Step approach for the Survival Memory Unit.

This scheme was proposed in [9] with the intention to discard all the operations that do not aect the output. The idea is to postpone the updating process until the survival path is found. Figure 4.16 shows this concept. The rst D steps intend to nd the survival path, while the remaining U steps updates the bit reliabilities. A FIFO(First In, First Out) memory is usually employed to delay the path metric dierences along with decision vectors until the updating process begins. The SMU we propose in this document is actually a Two Step conguration. However, we introduce a new scheme for nding the survival path.

3,

D i v

2,

D i v

1,

D iv k, i s

0,

D i v

38

Hardware Implementation of a Turbo Decoder based on SOVA

4.6.4

Other Architectures.

A lot of architectures and schemes have been proposed in the last years. In [4] dierent SMUs for the VA are studied and compared. In [6] a trace back architecture based on an orthogonal memory is presented. However, all these schemes deal with a nite trace back depth D and with a nite updating length U , which leads to a non optimum algorithm execution. In the next subsection we will introduce a new architecture for the SOVA algorithm that does not depend on the D-U trade-o.

4.6.5

Fusion Points Survival Memory Unit.

So far, two of the most common schemes have been studied. They are the RE and the SA. Both of them carry out a trace back with the aid of a pipeline architecture. The size of this pipeline architecture has an impact in the area, power consumption and BER performance. One of the contributions of this work is a new type of architecture based on a new algorithm and the development of the architecture that implements it. The major advantage of this new scheme is that it is independent of the D-U trade-os and it allows recursive processing which lessens the register activity. The new architecture to implement the SOVA algorithm that we propose, as it name suggests, deals with the Fusion Points. Figure 4.17 shows the general scheme. It consists of a FPU (Fusion Point Unit) which nds the time instant and the state where the survival paths merge. It is inside this unit where the new algorithm is implemented. There is a dual port RAM to store the data the ACSU provides, and nally there is a RUU that updates the bit reliabilities based on the information provided by the FPU. The unit works as follows: the data the ACSU provides is stored in the dual port RAM, the decision bits vi,k are also used by FPU to implement the FP search algorithm.

3,i

1, i

2 ,i

1, i

RAM

2 , iv

0, i v 1, i v 3 , iv

FPU

RUU

Figure 4.17: Fusion Points based SMU

4.6 Survival Memory Unit.

39

Possible Fusion Points

s0

s1

s2

s3
Merging paths Possible Fusion Points

Figure 4.18: Possibility of fusion points Whenever a FP is found, it is indicated to the RUU which updates the bits reliabilities by a tracing back method aided with the data fetched through the second port of the dual port RAM. 4.6.5.1 Fusion Points Unit

This unit nds the Fusion Points along the trellis for a code with rate 1 by means of a new 2 algorithm1 . The algorithm is based on the idea that a fusion point for a code rate 1 will 2 always reside in the merging point of two paths. Figure 4.18 shows these possibles fusion points. The following thought explain the previous idea: whenever a trace back operation takes place, the system traces back from a given time instant i; while tracing back, paths, at dierent time instants, merge in groups of two. The last of these two-paths merging point is a Fusion Point. Therefore a FP will always reside in the merging point of two paths. The following steps along with the example of gure 4.19 introduce part of the algorithm: Decision vectors coming from the ACSU, are used to identify the merging paths or possible fusion points Figure 4.19.a. Each possible fusion point is marked. Whenever a mark is set, the mark time and state are held in registers Figure 4.19.a. This mark is propagated along the branches to the next states Figure 4.19.b. The mark is propagated at every clock cycle. If a mark propagates to all the sates at a given time, then the origin of that mark is a fusion point. The fusion point coordinate is held by the register and can be recalled immediately Figure 4.19.c. After introducing the mark movements, gure 4.20 shows a sequence example where more
1

We develop the algorithm for a code of rate

1 , 2

however, it can be extended to any code rate.

40

Hardware Implementation of a Turbo Decoder based on SOVA

Possible Fusion Point detected at (0,S 0 )

s0

s1

s2

s3 i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8

(a) Possible fusion point detecction


Mark propagated

s0

s1

s2

s3 i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8

(b) Mark propagation


M ark propagated

s0

s1

s2

s3 i=0 i=1 i=2


Fusion point at (0,S 0 ) detected at tim e i=2

i=3

i=4

i=5

i=6

i=7

i=8

(c) Mark propagation and fusion point detection.

Figure 4.19: Fusion Point detection algorithm.

than one mark is handled at the same time. In the gure two columns can be appreciated. The left column indicates the time instant the system is processing, and also the status. The status is composed by three pointers that are able to hold the time and the states of FPs. The rst two pointers hold the possible FPs detected while the third pointer

4.6 Survival Memory Unit.

41

Time Instant

i=0

s0

s1 (0,s0) (-,-) (-,-) s3 i=0 s2

Pointer 0 Pointer 1 FP

Pointer 0 Pointer 1 FP

Pointer 0 Pointer 1 FP

Pointer 0 Pointer 1 FP

Pointer 0 Pointer 1 FP

Pointer 0 Pointer 1 FP

sutatS metsyS
Time Instant i=1

s0

s1 (1,s0) (1,s2) (0,s0)i=2 s3 i=1 s2

sutatS metsyS

sutatS metsyS

sutatS metsyS

sutatS metsyS

sutatS metsyS
Time Instant i=2 (1,s0) (2,s2) (-,-) Time Instant i=3 (1,s0) (3,s0) (-,-) Time Instant i=4 (4,s0) (4,s2) (3,s0)i=5 Time Instant i=5 (4,s0) (4,s2) (-,-)

s0

s1

Its pointer is free since it has no chance to become a fusion point

s2

s3 i=2

s0

s1

s2

Its pointer is free since it has no chance to become a fusion point

s3 i=3

s0
We find that the blue mark and the yellow mark become fusion points, however if we trace back from i=5 we find that paths merge at (3,s0). Whenever two marks coincide, the one with the latest origin is kept.

s1

s2

s3 i=4

s0

s1

s2

s3 i=5

Figure 4.20: Sequence of the Fusion Point algorithm

42

Hardware Implementation of a Turbo Decoder based on SOVA

indicates a FP. The right column shows the sequence from time time i = 0 to time i = 5. The algorithm proceeds as follows: i = 0, a possible fusion point is detected at (0, s0 ). A green mark is set and is propagated to the states (1, s0 ), (1, s3 ). Its coordinate (0, s0 ) is held in the pointer 0 register. i = 1, a possible fusion point is detected at (1, s0 ). A blue mark is set and is propagated to the states (2, s0 ), (2, s3 ). Another possible fusion point is detected at (1, s2 ). A fuchsia mark is set and is propagated to the states (2, s1 ), (2, s3 ). Since the green mark propagates to all the state at i = 2 its origin becomes a fusion point. The fusion point register is set with the data of the pointer 0, which holds the coordinate of the green mark, and pointer 0 is free. Also a green straight line across all the states at time i = 2 indicates the time the FP is detected. Note that even though the actual time instant is i = 1, the detection line of the FP is at i = 2. Before moving into the next time instant, the coordinates of the blue and fuchsia marks are stored in pointer 0 and pointer 1 registers respectively. Note that pointer 0 register was free when the fusion point was detected. i = 2, a possible fusion point is detected at (2, s0 ). A red mark is set and is propagated to the states (3, s1 ), (3, s3 ). The fuchsia mark is propagated to the state (3, s2 ) and its pointer is free. The reason why the fuchsia mark pointer is free will be explained later. The blue mark propagates to the states (3, s0 ), (3, s1 ), (3, s3 ). Before moving into the next time instant, the coordinate of the red mark is stored in pointer 1 since it is the only free pointer available. i = 3, a possible fusion point is detected at (3, s0 ). A yellow mark is set and is propagated to the states (4, s0 ), (4, s2 ). The red mark is propagated to the state (4, s3 ) and its pointer is free. The reason why the red mark pointer is free is the same as that for the fuchsia mark pointer in the previous instant and will be explained later. The blue mark is propagated to (4, s0 ), (4, s2 ), (4, s3 ). Before moving to the next time instant, the coordinate of the yellow mark is stored in pointer 1 since it is the only free pointer available. i = 4, a possible fusion point is detected at (4, s0 ). A turquoise mark is set and is propagated to the states (5, s0 ), (5, s2 ). Another possible fusion point is detected at (4, s2 ). A brown mark is set and is propagated to states (5, s1 ), (5, s3 ). Both, the blue mark and the yellow mark propagates to all the sates at i = 5. This means that the origin of the blue mark and the origin of the yellow mark are both fusion points. However, the point we are looking for, is the closest FP to the time being processed, which in this case that FP corresponds to the origin of the yellow mark at (3, s0 ). The reason is the denition of a FP. If the system traces back from time i = 5, it nds that all paths merge at (3, s0 ). So (3, s0 ) represents the point where all paths merge in a trace back operation from i = 5. The point (1, s0 ) corresponding to the origin of the blue mark, belongs to the survival path, but it does not represent a merging point for a trace back operation that starts at time i = 5. Now we can extend the previous thought in the following way: suppose that two marks propagate to the same states. This means, that in the future, their propagations will always be

4.6 Survival Memory Unit.

43

Address registers

FP addr current addr


FP sel

Mark Processing
0, i v

FP state

Mark Detection

Mark Propagation

FP detected

Figure 4.21: FPU architecture for a code with constraint length K = 3. the same. They will have the same possibilities to become fusion points. However the closest FP to the time being processed is the true FP. Then, it is not necessary to propagate and process the behavior of both marks. The mark that is relevant is the one with the origin closest to the time being processed. Therefore, we can enunciate the following rule: whenever two marks coincide, the one with the latest origin is kept. Finally, before getting back to the algorithm, it is time to explain why the red mark pointer and the fuchsia mark pointer were free in the previous steps. We saw that either mark propagated to only one state. Therefore if the system keeps propagating those marks, in the best case, they will coincide in the future with a possible fusion point, and whenever two marks coincide the one with the latest origin is kept. For this case the mark to be kept is the future possible fusion point. Summarizing, this last rule becomes: whenever a mark propagates to only one state, it has no chance to become a FP in the future, then its pointer can be freed. Now that we have set the main ideas and rules, we return to the algorithm. The fusion point register is set with the pointer 1 data. The pointer 1 and the pointer 0 are free, and then the coordinates of the turquoise mark and the brown mark are stored in them. i = 5 The algorithm is executed, but there are no possible fusion points detections, only mark propagations. Figure 4.21 presents a design of the FPU for a code with constraint length K = 3. It consists of a Mark Detection Unit, which uses the decision bits vi,k provided by the ACSU

2 , iv 3, iv

1, i v

State code registers

Mark registers

Combinational Logic

Memory

44

Hardware Implementation of a Turbo Decoder based on SOVA

to detect possible FPs according to the trellis buttery. There is a Mark Propagation block, which propagates, along the trellis, the new marks and the stored marks. There is a processing unit, which compares all the marks at the input, and proceeds as follows: if there are two equal marks, then the one with the latest address is kept. if there is a mark with only one bit set, then its corresponding register is freed, since it has no chance to become a FP in the future. if there is a mark with all bits set, then a FP is indicated with its address and state. Finally there is a set of registers used to hold marks, addresses, and state codes. It is important to point out some major concerns: The algorithm can be computed by recursion. There are at most n new possible FPs at each time instant, where n is the number 2 of states of the encoder. Simulations have shown that for an RSC encoder of rate the amount of registers the FPU needs is:
1 2

with n = 2K1 states,

n2, registers of n+1 bits to hold marks the remaining bit is used to indicate if the register is empty. n 2 registers of K 1 bits to hold state codes. n 2 registers of A bits to hold addresses, where A is the number of bits used to code the frame size. Since the processing unit compares all marks at the same time to see if there are equal marks, then the number of XOR gates increases drastically with the constraint length of the code. However it has been observed that Turbo Code schemes with encoders with short constraint length have better BER performance than encoders with large constraint length [18]. Comparing our approach with the previous implementations, we conclude with the results of table 4.1 for an RSC code of rate 1 , K = 3 and a message frame size of 1024. We see 2 that for a code with constraint length K = 3, a frame size 2A = 210 = 1024 bits, and a trace back depth of D = 5 K, the RE SMU needs (5 3) 4 = 60 register of one bit, and the FPU needs (4 2) (4 + 1) + (4 2) 2 + (4 2) 10 = 34 register of one bit. Also, the FPU will always nd the correct FP, while the RE SMU might produce wrong results, if paths do not merge within the trace back pipeline. Another dierence is that the RE outputs the symbol sequence of the survival path, while the FPU outputs the sequence of FPs that are spread along the trellis. However, in a turbo code scheme context, the RUU may take advantage of these FPs as we will show in the next subsection.

4.7 Fusion Points based Reliability Updating Unit.

45

Observation One bit Registers Reliability Output Rate

REU 60 depends on the trace back depth One state per clock cycle

FPU 34 Optimum Random

Table 4.1: Comparison between the REU and FPU for a code with rate 1 , K = 3 and a 2 frame size 2A = 210 = 1024
The reliability of these bits, could depend on or

s0

s1

s2

s3 i=0 i=1 survival path i=4 i=2 i=3 competing path Possible competing path in the future

Figure 4.22: Reliability updating problem

4.7

Fusion Points based Reliability Updating Unit.

Before getting into the hardware issues it is important to highlight the main problem we face at the moment of updating bit reliabilities. For example, gure 4.22 illustrates one example. While processing data at time instant i = 4, a FP is found at (3, s0 ). This FP is colored in green. The example shows the survival path and the competing path traced back from the FP until they merge. The blue branches indicate possible future branches of the survival path, while the red paths indicate possible competing paths in the future. The RUU could start to update bit reliabilities as soon as a FP is detected. However, gure 4.22, shows how the reliability of bits i = 2, i = 3 might depend on 4,0 , or 4,2 . The earlier release of those bit reliabilities leads to a non optimum SOVA algorithm execution. One solution to the mentioned problem, is illustrated in gure 4.23. The idea is to trace back U steps, to allow all the competing path, that start after time i, to merge. After U steps, the remaining bit reliabilities could be released. However, this solution introduces the U factor which is a trade-o between BER performance and power consumption. It has no impact on the area, since as we will show later, bits reliabilities are updated recursively. Anyway, the introduction of the U factor, leads to a non optimum SOVA algorithm execution. The solution we adopted is introduced by the example of gure 4.24 . By the time i, two FPs have been detected. Since the second FP, resides after the detection line of the rst one, the updating process takes place starting from the second FP. Once, the rst FP is reached, the system continues updating and releasing the bit reliabilities. The fact that

2, 4

0, 4

0,3

2,4

0,4

46

Hardware Implementation of a Turbo Decoder based on SOVA

Updating and Releasing

Updating Only

s0 s1 s2 s3
survival path competing path iFP -U

Figure 4.23: One possible solution to the problem of bit reliabilities releasing.
Bits reliabilities before i FP1, will not be affected by future competing paths. Therefore they can be released . paths traced back from any instant after i DFP1 will merge at IFP1 Updating and Releasing Updating Only

s0 s1 s2 s3
iFP1

iDFP1

iFP2

Possible competing path in the future

Figure 4.24: Solution adopted for the bit reliabilities releasing problem. the second FP needs to reside after the detection line of the rst one, is due to the concept that any path traced back after the detection line will merge at the FP of that detection line. Therefore any future competing path of the survival path, at most will merge at the rst FP, and will not aect the bit reliabilities before the rst FP. We can generalize this solution in an algorithm as follows: Wait for the rst FP provided by the FPU. Wait for the second FP. If the second FP is detected after the detection line of the rst one then, proceed with the updating process. If the second FP is detected before the detection line of the rst FP, then wait for one more FP: If the third FP resides after the detection line of the second FP then, proceed to the updating process with the information of the second and third FP. If the third FP does not reside after the detection line of the second FP, but it does after the detection line of the rst one, then the updating process proceeds, with the information of the rst and third FP.

3 , PF I

iFP

iDFP2

4.7 Fusion Points based Reliability Updating Unit.

47

If the third FP does not reside after the detection lines of any of the other two FPs, then the third FP is discarded. The RUU continues from step 4. When the updating process nishes, then the last FP, becomes the rst FP, and the process is repeated from step two. If the end of the frame is reached by the ACSU, then the RUU is interrupted and it begins to update the bits reliabilities from the end. Figure 4.25.a presents the RUU general scheme. There is a state machine which controls the unit. It also carries out the previous algorithm. The registers at the left of the gure, hold FP state codes, FP addresses and FP detection lines which are used to address the RAM block and control the updating process. The lastState Unit, calculates the previous state in the trellis, based on the current state, and the decision bit for that state. This unit is actually doing the trace back, at each clock cycle, of the survival path. The current state is used to drive the multiplexers to select the message bit associated with the survival path and the dierence between the metric of the survival path, and a competing path. These elements are fed into the recursive Updating unit, which calculates the reliability magnitude of bits i . The term Lepi is stored in the RAM block in conjunction with the decision bits vi,k and i,k . This term is equivalent to: Lepi = ysi + Lai which is used to calculate the nal extrinsic information Lei : Lei = i Lc ysi Lai Lei = i (ysi + Lai ) Lei = i Lepi The term Lepi is calculated when ysi and Lai are available at the time of branch metrics because it saves clock cycles at the time of computing Lei . Not doing it a that time supposes to access the data-in RAM buer and the RAM La/Le-Le/La again. Besides, the access has to be done through the interleaving/deinterleaving unit, which might be being used. The calculation of Lei is done in the following way: the recursive units outputs i , which is actually the magnitude of i . The bit mi gives the sign to i . Since a two complement representation is used, the bit mi will indicate whether to complement i or not. Then we have: Lei = i + (0 Lepi ) Lei = not i + (1 Lepi ) mi = 1 mi = 0

The operation in parenthesis is done rst and its result is delayed until i comes out of the recursive unit. This allows to distribute combinational delays among the registers. The resulting Lei is stored in the RAM La/Le-Le/La depending on the decoder. The recursive updating unit is shown in gure 4.26. This unit updates bits reliabilities by managing all competing path at once. In the scheme there is a set of register that holds the dierent for each state. These are propagated to the corresponding

Hardware Implementation of a Turbo Decoder based on SOVA

It calculates the last state of the sequence based on the current state and decision bit

previous state
FPs state code register
put_frame_end_data

Current state lastState Starting state

Survival path states m


Starting state

decision bit

queue

si,k si+1,k

Message bits corresponding to the survival path

Decision bits load downCounter delay

f2

>=

f1 Lepi Lepi+1

delay

flushing_addr_start

FPs detection line address registers m

finish

delay

finish_delayed

b
a<=b flushing delay

wr le

Delay These multiplexers select the end-of-the-frame data when the ACSU reaches the end of the frame. Delay

addr_x^

addr_le

Figure 4.25: Fusion Points based Reliability updating unit


48

( + ) + This operation is equivalent to xxxxxxxxxxxxxx. + This is done this way in order to distribute combinational delays among register

>=

f2

eL

State Machine

peL

m2

k, i

RAM SMU

+
m

k,1 i

k, i

FPs address registers

mi+1
m

Path metric differences at each state of the survival path.

The inversion is controlled by the bit mi+4

Recursive Update

+
m

NOT

4.7 Fusion Points based Reliability Updating Unit.

49

p0
decoder

si,k

p2 p3
it complements the decision bit of the survival state ps0 ps1 ps2 ps3 p1

ps1 m

p0 p1 p2 p3

ps2 p2
0 , iv

mi mi mi mi

c0 c1 c2 c3

ps3 p3 m

Trace back Trellis connection topology

D Trace back of all the competing paths

Figure 4.26: Recursive Updating Unit


s0 s1 s2 s3
i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i = 10 survival path competing path competing path competing path competing path

Figure 4.27: Recursive Updating Process. previous states by the pair of multiplexers and the reverse trellis topology connection according to the trellis decision vector. The moving of the is actually the trace back of the competing paths. It can be observed its similarities to the ACSU in the recursive procedure. Whenever two competing path merge, the one with the minimum is kept. At each stage, the decision bits along with the estimated message bit, are used to drive the multiplexers that select the relevant . The minimum among these relevant is the resulting bit reliability. In order to clarify how the recursive unit works, we will introduce the example of gure 4.27. The set of registers from gure 4.26 will hold the colored from gure 4.27. When the updating process is lunched, the set registers are set to M AX .

XAM

XAM

XAM

XAM

c2 m

c3 m

Calculates the Minimum D

= min ( , , )

XAM

XAM

XAM

XAM

p1

ps0 p0 m

MIN

it selects the paths that differ in the message bit associated.

c0 m

2, iv

3, i v

2, iv

3, i v

1, i v

1, i v

0 , iv

MIN

MIN

m m m

m m MIN m MIN m

c1 m

MIN

MIN

50

Hardware Implementation of a Turbo Decoder based on SOVA

The unit begins at the time instant i = 10. The orange is fed into the system through the multiplexer from state 1. A the same time a minimizing process is started with this orange and the remaining of the registers. The orange is sent to state 3. At time i = 9. The blue is fed into the system through the multiplexer from state 2. The orange from state 3 and the blue from state 2 participate in the minimizing process. The blue it is sent to state 1, while the orange is sent to state 3 again. At time i = 8. The fuchsia is fed into the system through the multiplexer from state 0. Now there are three participating in the minimizing process. Finally the orange, blue and fuchsia are sent to state 2, state 3 and state 0 respectively. The remaining steps are about the same. Note that, at time i = 6, state 2, two competing path merge. For this example the blue is assumed to be less than the turquoise and that is why it is kept. Before moving into the next section it is important to talk about some throughput issues. In gure 4.24 we can see that the unit RUU updates some distance before it can release the nal bit reliabilities. Therefore if we think of the time distance between fusion points as a random variable with a mean D. Then the RUU processes 2D time instants for each FP detected by the FPU. This means, that the FIFO input data rate will be higher than FIFO output data rate and the FIFO will get full. If the FIFO gets full, then the RUU misses some FPs, however, this is not as bad as is seems, since the algorithm that manages the FPs is still valid. Let denote the amount of bits remaining to be updated, when the ACSU unit reaches the end of the frame, with the parameter DR . Then the throughput of the SOVA SISO can be estimated by T HSISO = L f [bps] L + DR (4.1)

where L is the frame size and f is the frequency of the system. It is straight forward, that if we want to increase the throughput of the system DR should be reduced. This can be achieved by increasing the working frequency of the RUU so it processes more FPs per time unit and at the end of the frame, less bits remain to be updated.

4.8

Control Unit

We nally present the design of the control unit, which is basically a nite state machine that delays and synchronizes modules. Figure 4.28, shows the scheme. There are two counters, one is responsible for the frame address count, and the other is responsible for the iteration count. The iteration counter is rst loaded with the number of iterations that the user indicates. Figure 4.29 shows the state diagram that the entire system goes through. Once the user drives the go signal high, the system begins to work. It rst initializes the units and progressively activate the corresponding modules before settling

4.8 Control Unit

51

niters

x2+1

Iteration Counter Bit Counter

niters

=0?

iters Finished

State Machine

=?
Frame Length

Frame Finished

Figure 4.28: Control Unit General Scheme.

Idle

Iters Finished

go

Finishing

/Iters Finished

Initializing Modules

Frame Finished

Decoding

Figure 4.29: Control Unit State Diagram. down in the decoding state. Once the end of the decoding process is reached the system checks whether there is an iteration left or not.

52

Hardware Implementation of a Turbo Decoder based on SOVA

4.9

Improvements

The most common implementation of the SOVA decoder only updates bits reliabilities by the HR-SOVA rule that was described in 3.2.2. A BR-SOVA updating rule would be desirable since it has been proved in [5] that max-log-map algorithm and BR-SOVA are equivalent and that the max-log-map algorithm perform better in terms of BER than the HR-SOVA. However the BR-SOVA updating rule requires the knowledge of the bit reliabilities of the competing paths which implies a higher complexity in the decoder and this is the reason why we do not do a strict BR-SOVA, instead we approximate its behavior by introducing a bound for the bit reliability of the competing path as shown below. The BR-SOVA updating rule and HR-SOVA updating rule are the same when the estimated bit and the competing bit are dierent. In contrast the following equations recall the updating rules for each algorithm when the estimated bit and the competing bit dier. BR min j , i,k + c j j HR j j If we assume c = , equation 4.2 can be rewritten as: j j min j , i,k + c j That is why we can think of the HR-SOVA as a BR-SOVA with an unbounded c . The j improvement proposed in this work is to bound c to a known value. When working j with an RSC binary code, the two incoming branches, at any state of a trellis diagram, are associated with dierent message bits. Therefore, the dierence between the path metrics is actually a bound for the reliability of those message bits. The resulting updating rule becomes: j min (j , j,k ) mi = ci j min j , i,k + c j mi = ci (4.2)

where j is the reliability of bit j; i,k is the path metric dierence between competing path and survival path; mi is the estimated message bit; ci is the estimated message bit which is associated with the competing path and nally c is the path metric dierence j at each state at time j that belongs to the competing path. Figure 4.30 and 4.31 show the modied RUU and the Recursive Updating Unit respectively. They allow the previous rule to be executed. Note that the main dierence is the handling of all the since they represent the bound for the competing bit reliabilities.

previous state
Current state
put_frame_end_data

4.9 Improvements

lastState

Starting state

decision bit

queue

Starting state

load

downCounter

si,k si+1,k
delay

m m

vi ,k
mi+1

vi +1,k
f2

RAM SMU

i,k i +1

m State Machine

Recursive Update

i+4

NOT

>=

f2

i +1, k
Lepi Lepi+1

Lei +5
delay

>=

f1

1
finish delay

finish_delayed

flushing_addr_start

Figure 4.30: Reliability Updating Unit with BR-SOVA approximation


53

m a<=b

b
flushing delay

wr le

Delay

addr_x^

All deltas need to be available for the recursive update

Delay

addr_le

54

Hardware Implementation of a Turbo Decoder based on SOVA

i,k
The

i +1,k is the bound for c i+1

i
decodificador

ps0 p0

p0 p1 p2 p3

si,k

MAX
m

i +1, 0
MIN

c0 m

0
+

ps1

vi , 0
p0 ps0 ps1 ps2

p1

v i ,1
p1

MAX
m

i +1,1
0

m MIN

c1 m

MIN

vi ,2
p2

v i ,3
p3 ps3 ps2

vi , 0
mi

p2

c0 c1 c2 c3

v i ,1
mi

MAX
m

i +1, 2
MIN

c2 m

i +3

MIN

0
+

vi , 2
mi

v i ,3
mi

ps3 p3

i+1, 3
MIN

c3 m

MIN

MAX
m

Figure 4.31: Recursive Update with BR-SOVA approximation

0
+

D Trace back of all the competing paths

Calculates the Minimum D

Chapter 5

Methodology
The whole practical design process was carried out with the aid of powerful software tools. Mainly three tools were employed in this thesis: Matlab 7.1. The mathematics software package Matlab was extensively used in the simulation and verication of the design. It was employed to model the whole communication system: encoder, channel, receiver and decoder. We also used Matlab for the HIL (Hardware In the Loop) verication of the design. It was carried out by establishing a serial port communication with an interface circuit specically developed for testing purposes. Xilinx ISE 8.2. The synthesis software package of Xilinx, ISE 8.2, was used in all the tasks referred to the implementation, specically the mapping, translation, placement and routing along with the back annotation and the static timing analysis. The FPGA programmer iMPACT is also included in this package; it was used to download our design into the Xilinx Spartan III FPGA. ModelSim 6.1. VHDL code and Post-Place and Route models were simulated with this tool. Figure 5.1 summarizes the work ow. Five steps have taken place with some feedback between them. On the rightmost part we have the fundamental stages of this process whereas on the leftmost part the verication tasks associated with each stage are displayed. The blue boxes show the main tool employed in the related task. Now we give a description for the stages of the process: Information recopilation. A considerable amount of papers and journals were recopilated. They allowed us to understand the main problem and to focus our main concerns on some aspects of the subject. Specication. The specication of this work consisted on the design and implementation of a SOVA based Turbo Decoder implementation. High Level Design. A high level model was programmed using the software tool MATLAB 7.1. This model allowed us to try the system in dierent environments and also to ne tune the design specications cited in step two.

56

Methodology

Information Recopilation

Design Specifications

Matlab High Level Design Implementation High Level Design Verification

ModelSim VHDL Implementation Behavioral Verification

ModelSim FPGA Post-Place & Route Model Verification VHDL Synthesis In-Circuit Verification Matlab

Figure 5.1: Project Work Flow. VHDL Implementation. Once we were familiar with all the concepts related to the decoding algorithm we started to work on the structure of the datapath. It was described on VHDL code and all the combinational modules were veried by appropriate test-benches on ModelSim. After the Datapath was totally dened, we began to specify the control needs of our system and the way it would communicate exteriorly, subsequently we gradually dened the whole system. VHDL Synthesis. After a VHDL functional model was achieved, the synthesis was carried out. The targeted device was an Spartan 3 X3S200FT256. The system was rst veried by a Post-Place and Route model. Later the FPGA was programed with the iMPACT tool for In-Circtuit verication. Figure 5.2 illustrates the approach employed for this purpose while gure 5.3 shows the followed procedure. The serial port baud rate was set to 115200 bps.

57

Matlab

Interface Unit

Decoder

Spartan 3 X3S200FT256 FPGA


RS232 Serial Port

Figure 5.2: Hardware-in-the-loop approach

mi

source

Channel Coding

BPSK Modulation

BER Calculation

AWGN Discrete Channel

sink

mi yi

Matlab

FPGA
Channel Decoding

Spartan 3 X3S200FT256 FPGA

Figure 5.3: Hardware-in-the-loop verication procedure

58

Methodology

Chapter 6

Measures and Results


The system presented in chapter 4 was described using VHDL (Very high speed integrated circuits Hardware Description Language). A generic and parameterizable VHDL code was written. A VHDL package code includes the frame size, quantization scheme, polynomials of the code, and the SOVA algorithm mode (HR-SOVA or BR-SOVA approximation). The system can be congured through this package before the synthesis is performed. The targeted device was a general purpose Xilinx FPGA Spartan 3 X3S200FT256. All the tests have been done for two major polynomial pairs. One is the pair we have been using through all this work, Pf b = [111], Pg = [101]. The other pair is the UMTS polynomial pair, Pf b = [1011], Pg = [1101]. The size of the data frame has been set to 1024 bits and it is the same for all simulations and syntheses. The depth of the RUU FIFO has been set to 16 FPs. We have employed two types of interleavers. One of the interleavers is given in [14] from now on, MCF. It is described by the following equations: (x) = 31x + 64x2 mod 1024, Deinterleaver ; (x) = 991x + 64x2 mod 1024, Interleaver. The other interleaver was randomly generated from now on RAND. Its function is described by a look-up table. The tests with the normalization by the xed scaling factors, have not been done yet and is left as future work. We will present the results in the following subsections according to their nature.

6.1

Quantization Scheme

The quantization scheme is presented in table 6.5. The same quantization scheme is used in all the tests. This scheme has been adopted from [18]. Element Received Symbols yi Extrinsic Information Path Metrics s Word width : Fractional Part 4:2 7:2 10:2 4:2

Table 6.1: Quantization Scheme Summary

60

Measures and Results

10^(0) 4:2 6:2 8:2

10^(1)

10^(2)

10^(3) BER 10^(4) 10^(5) 10^(6) 10^(7) 0.5

1.5 2 EbNo [dB]

2.5

Figure 6.1: quantization eect on the system BER performacne. BR-SOVA approximation scheme. Simulation with quantication. MCF. Pf b = [111], Pg = [101]

The only quantization study that has been carried out is related to the path metric dierence which has a signicant impact in the system BER performance. Figure 6.1 shows the BER curve against the received signal SNR. It is observed that, for the current example, the scheme 4:2 is better than the 6:2, 8:2. This behavior has been reported in [11] as a method of improving the system BER performance. Since quantization saturates the , the overoptimistic values of the bit reliabilities are lessened and consequently the system BER performance increases. Note that adopting the reduced quantization scheme yields more benets. First the RAM that stores the data the ACSU is reduced. Furthermore the logic related to the RUU is also reduced.

6.2

Synthesis Results

Tables 6.2 and 6.3 present the synthesis results for the short pair of polynomials and the UMTS polynomials respectively. Both pairs of polynomials were synthesized with the quantization scheme given in table 6.5.

6.2 Synthesis Results

61

Observation Logic Utilization Number of Slice Flip Flops Number of 4 input LUTs Logic Distribution Number of occupied Slices Number of Slices containing only related logic Number of Slices containing unrelated logic Total Number 4 input LUTs Number used as logic Number used as a route-thru Number of Block RAMs Number of MULT18X18s Number of GCLKs Total equivalent gate count for design

HR 720(18%) 776(20%) 677(35%) 677(100%) 0 789(20%) 776 13 10(83%) 1(8%) 4(50%) 671207

BRap 752(19%) 803(20%) 674(35%) 674(100%) 0 816(21%) 803 13 10(83%) 1(8%) 4(50%) 671658

Resources 3840 3840 1920 674 674 3840 1 1 12 12 8

Table 6.2: Synthesis Results for Pf b = [111], Pg = [101]

Observation Logic Utilization Number of Slice Flip Flops Number of 4 input LUTs Logic Distribution Number of occupied Slices Number of Slices containing only related logic Number of Slices containing unrelated logic Total Number 4 input LUTs Number used as logic Number used as a route-thru Number of Block RAMs Number of MULT18X18s Number of GCLKs Total equivalent gate count for design

HR 1045(27%) 2067(53%) 1256(65%) 1256(100%) 0 2082(54%) 2067 15 11(91%) 1(8%) 4(50%) 748769

BRap 1108(28%) 2096(54%) 1329(69%) 1329(100%) 0 2111(54%) 2096 15 11(91%) 1(8%) 4(50%) 749432

Resources 3840 3840 1920 674 674 3840 1 1 12 12 8

Table 6.3: Synthesis Results for Pf b = [1011], Pg = [1101] Note that the BR-SOVA approximation spent almost the same amount of resources than the HR implementation. In contrast the amount of used resources signicantly increases when working with the UMTS polynomials. This is due to the fact that the UMTS encoder has twice the number of states. Table 6.4 shows the max frequencies that the system can attain. When working with the pair of short polynomials, the system can reach up to 85 MHz. The critical path is located in the ACSU unit and it is related to the add, compare, select and quantization delays. On the other side, when working with the UMTS pair of polynomials, the maximum clock frequency suers a considerable degradation. This is due to the excessive

62

Measures and Results

10^(0)

10^(1)

10^(2)

10^(3) BER 10^(4)


HR Iter 1 BRap Iter 1 HR Iter 3 BRap Iter 3 HR Iter 5 BRap Iter 5 HR Iter 8 BRap Iter 8 Maxlogmap Iter 8

10^(5)

10^(6)

10^(7) 0.5

1.5 2 EbNo [dB]

2.5

Figure 6.2: HR-BRapprox comparison. Innite precision simulations. MCF interleaver. Pf b = [111], Pg = [101] combinational logic that the FPU gets for an eight states code. The optimization of these units should be considered as future work. Polynomials Short Pf b = [111], Pg = [101] UMTS Pf b = [1011], Pg = [1101] Maximum clock frequency 85 MHz 29 MHz Critical Path ACSU FPU

Table 6.4: Maximum clock frequencies

6.3

Bit Error Rate Results

Before getting into the HIL results, we will discus the BR-SOVA approximation BER performance that it is shown in gure 6.2. These results were obtained by simulation with a oating point numeric representation. We observe that for an error probability of 104 the BR-SOVA approximation gains 0.3dB over the HR-SOVA at the eighth iteration. For an error probability of 105 the BR-SOVA approximation gains only 0.23dB over the HR-SOVA at the eighth iteration. We also observe that for higher SNRs, the curves begin to converge and the distance between them gets shorter. Figure 6.3 exhibits the real system BER performance when implementing the HRSOVA for the short pair of polynomials. The gure illustrates the comparisons between the hardware implemented HR-SOVA and the oating point simulations. Note that the real HR-SOVA performs better. This is due to the quantization eect that was explained in 6.1.

6.3 Bit Error Rate Results

63

10^(0)

10^(1)

10^(2)

10^(3) BER 10^(4) 10^(5)

10^(6)

HR Iter 1 Inf.Pre HR Iter 1 HIL HR Iter 5 Inf.Pre HR Iter 5 HIL HR Iter 8 Inf.Pre HR Iter 8 HIL Maxlogmap Iter 8 Inf.Pre

10^(7) 0.5

1.5 2 EbNo [dB]

2.5

Figure 6.3: HR-SOVA HIL results. MCF interleaver. Pf b = [111], Pg = [101] Figure 6.4 shows the real system BER performance when implementing the BR-SOVA approximation for the short pair of polynomials. For low SNRs, we see that the real decoder performs worse than the oating point simulation. For high SNRs the opposite situation is observed. Note, that for the BR-SOVA approximation the BER performance of the real decoder is about the same as the BER performance of the oating point simulation. The quantization does not improve the BER as much as in the HR implementation. The comparison between the HR-SOVA implementation and BR-SOVA approximation implementation is shown in gure 6.5. The gure also shows a partial plot of a quantized max-log-map algorithm with the following quantization scheme: Element Received Symbols yi Extrinsic Information Word width : Fractional Part 4:2 7:2 7:2 9:2 9:2

Table 6.5: Quantization Scheme Summary We observe that, in the worst case, the HR-SOVA is 0.14dB from the BR-SOVA approximation and the latter is only 0.1dB from the quantized implementation of the max-log-map. Finally gures 6.6 and 6.7 show some partial results of the BER performance with the UMTS polynomials and the randomly generated interleaver.

64

Measures and Results

10^(0)

10^(1)

10^(2)

10^(3) BER 10^(4) 10^(5)

10^(6)

BRap Iter 1 Inf.Pre BRap Iter 1 HIL BRap Iter 5 Inf.Pre BRap Iter 5 HIL BRap Iter 8 Inf.Pre BRap Iter 8 HIL Maxlogmap Iter 8 Inf.Pre

10^(7) 0.5

1.5 2 EbNo [dB]

2.5

Figure 6.4: BR-SOVA approximation HIL results. MCF interleaver. Pf b = [111], Pg = [101]

10^(0)

10^(1)

10^(2)

10^(3) BER 10^(4) 10^(5)


HR Iter 8 HIL BRap Iter 8 HIL Maxlogmap Iter 8 Quant. Maxlogmap Iter 8 Inf.Pre

10^(6)

10^(7) 0.5

1.5 2 EbNo [dB]

2.5

Figure 6.5: HR-BRapprox HIL comparison. MCF interleaver. Pf b = [111], Pg = [101]

6.3 Bit Error Rate Results

65

10^(2)

10^(2.5)

10^(3)

10^(3.5) BER 10^(4)


HR Iter 1 BRap Iter 1 HR Iter 3 BRap Iter 3 HR Iter 5 BRap Iter 5 HR Iter 8 BRap Iter 8

10^(4.5)

10^(5)

10^(5.5) 0.5

1.5 2 EbNo [dB]

2.5

Figure 6.6: HR-BRapprox comparison. Innite precision simulations. RAND interleaver. Pf b = [1011], Pg = [1101]

10^(2)

10^(2.5)

10^(3)

10^(3.5) BER 10^(4) 10^(4.5)


BRap Iter 1 Inf.Pre BRap Iter 1 HIL BRap Iter 5 Inf.Pre BRap Iter 5 HIL BRap Iter 8 Inf.Pre BRap Iter 8 HIL

10^(5)

10^(5.5) 0.5

1.5 2 EbNo [dB]

2.5

Figure 6.7: BR-SOVA approximation HIL results. RAND interleaver. Pf b = [1011], Pg = [1101]

66

Measures and Results

6.4

Throughput Results

In this section we investigate the eect of running the RUU at higher frequencies and its impact in the system throughput. A DCM (Digital Clock Manager) was used in order to generate the corresponding frequencies. Figures 6.8, 6.9 and 6.10 show the throughput histogram statistics for the frequencies relations fRU U = f ,fRU U = 2f and fRU U = 3f respectively and for the short pair of polynomials. The statistics were generated with 50000 samples. We observe that the throughput increases whit the RUU working frequency as we expected. In a real application context, the system has to guarantee a constant throughput so it could be set to one of the minimum intervals observed in the histogram. These values are summarized in table 6.6. We can think of a power saving benet since the system, according to the gures, will work faster than the guaranteed throughput. Therefore, when the system nishes the execution it goes to an idle state until a new set of data arrives, during this idle state no activity is performed in the circuit which supposes an important reduction in the power consumption. Figures 6.11, 6.12 and 6.13 show the same throughput histogram statistics but this time for the UMTS pair of polynomials. We observe the same eect than with the short pair. However we notice a slightly dierence in the statistics between them. This is due to the appearing frequency of FPs, which is higher for higher constraint lengths. Observation fRU U = f fRU U = 2f fRU U = 3f Short Polynomials 0.5259f [bps] 0.8258f [bps] 0.9543f [bps] UMTS Polynomials 0.5270f [bps] 0.8308f [bps] 0.9399f [bps]

Table 6.6: Minimum estimated throughput.

6.4 Throughput Results

67

4000 3500 3000 Number of observations 2500 2000 1500 1000 500 0 0.52

0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 SISO throughput as a percentage of the system clock in [bps]

0.61

Figure 6.8: Throughput statistics. f = 25M Hz, fRU U = 25M Hz. Pf b = [111], Pg = [101]

3000

2500

Number of observations

2000

1500

1000

500

0 0.82

0.83 0.84 0.85 0.86 0.87 0.88 0.89 SISO throughput as a percentage of the system clock in [bps]

0.9

Figure 6.9: Throughput statistics. f = 25M Hz, fRU U = 50M Hz. Pf b = [111], Pg = [101]

68

Measures and Results

5000 4500 4000 Number of observations 3500 3000 2500 2000 1500 1000 500 0 0.95 0.955 0.96 0.965 0.97 0.975 0.98 SISO throughput as a percentage of the system clock in [bps] 0.985

Figure 6.10: Throughput statistics. f = 16.66M Hz, fRU U = 25M Hz. Pf b = [111], Pg = [101]

4000 3500 3000 Number of observations 2500 2000 1500 1000 500 0 0.52

0.54 0.56 0.58 0.6 SISO throughput as a percentage of the system clock in [bps]

0.62

Figure 6.11: Throughput statistics. f = 25M Hz, fRU U = 25M Hz. Pf b = [1011], Pg = [1101]

6.4 Throughput Results

69

4000 3500 3000 Number of observations 2500 2000 1500 1000 500 0 0.82

0.84 0.86 0.88 0.9 0.92 SISO throughput as a percentage of the system clock in [bps]

0.94

Figure 6.12: Throughput statistics. f = 25M Hz, fRU U = 50M Hz. Pf b = [1011], Pg = [1101]

8000 7000 6000 Number of observations 5000 4000 3000 2000 1000 0 0.935

0.94 0.945 0.95 0.955 0.96 0.965 0.97 0.975 SISO throughput as a percentage of the system clock in [bps]

0.98

Figure 6.13: Throughput statistics. f = 16.66M Hz, fRU U = 50M Hz. Pf b = [1011], Pg = [1101]

70

Measures and Results

6.5

Power Results

The power consumption has been estimated by simulations. Table 6.7 summarizes the results. The system frequencies were set to f = 25M Hz, fRU U = 50M Hz. The simulation test bench was carefully designed in order to guarantee a SISO throughput of 0.8f = 20M bps. This throughput is feasible according to gures 6.9 and 6.12. We observe a dynamic power consumption of (22 12) = 10mW for the Short pair of polynomials. The dynamic power consumption rises up to (29 12) = 17mW when working with the UMTS polynomials. This eect was expected since the area increasing is about 50% when jumping from four states to eight. Table 6.7 only shows the power consumption of the BRSOVA approximation since the dierence between the BR-SOVA approximation scheme and the HR-SOVA scheme is negligible. Observation Total estimated power consumption[mW] Vccint 1.20V: Vccaux 2.50V: Vcco25 2.50V: Clocks: Inputs: Outputs: Vcco25 Signals: Quiescent Vccint 1.20V: Quiescent Vccaux 2.50V: Short Polynomials 47 22 25 0 6 1 2 0 2 12 25 UMTS Polynomials 54 29 25 0 6 1 4 0 5 12 25

Table 6.7: Estimated Power consumption. BRapprox. f = 25M Hz, fRU U = 50M Hz

Chapter 7

Conclusions and future work


We have design a complete Turbo Decoder based on the SOVA algorithm. For this purpose we have introduced a new algorithm for doing the SOVA decoding and we have designed the architecture that implements it. The resulting design is not aected by the D-U tradeo and it achieves an optimum SOVA execution. We have also introduced a modication to the previous architecture that approximates the BR-SOVA. The resulting BER of this last scheme is 0.1 dB from a comparable Max-Log-Map algorithm. As future work, the following key points are proposed: The system throughput is aected by the management of the fusion points. Dierent schemes should be studied with the aim of improving the resulting throughput. For example, a LIFO memory could be employed instead of a FIFO at the input of the RUU. The power consumption of the system could be reduced by properly selecting the FPs that lunch the reliability updating process. This way, a long updating-withoutreleasing process can be avoided. The critical path of the system, for the UMTS polynomials, resides inside the FPU. Optimization strategies should be analyzed in order to reduce the combinational delays. An non-optimum SOVA execution should be adopted by properly reducing and managing the RAM buer that is used to store the data the ACSU units provide. Taking into account other implementations, this memory could be reduced by more than 50% without major BER performance degradation. A complete BR-SOVA should be carefully studied for implementation. This could be probably achieved by replicating the recursive updating unit. One of these units traces back and updates the survival path, while the others do the same with the competing paths.

72

Conclusions and future work

Bibliography
[1] Sorin Adrian Barbulescu. What a wonderful turbo world ... E-book, 2004. [2] G. Battail. Pondration des symboles dcods par lagorithem de Viterbi. Ann. e e e Telecommun, 42:3138, January 1987. [3] C. Berrou, A. Glavieux, and P. Thitimasjshima. Near Shannon Limit Error-Correcting Coding and Decoding: Turbo-Codes. Proceedings of the IEEE International Conference on Communications, Geneva, Switzerland, May 1993. [4] Gennady Feygin and P.G. Gulak. Architectural Tradeos for Survivor Sequence Memory Management in Viterbi Decoders. IEEE TRANSACTIONS ON COMMUNICATIONS, 41:425429, March 1993. [5] Marc P. C. Fossorier, Frank Burkert, Shu Lin, and Joachim Hagenauer. On the Equivalence Between SOVA and Max-Log-Map Decoding. IEEE COMMUNICATIONS LETTERS, 2(5), May 1998. [6] David Garret and Micea Stan. Low Power Architecture of the Soft-Output Viterbi Algorithm. Low Power Electronics and Design, 1998. Proceedings. 1998 International Symposium on, pages 262 267, August 1998. [7] Joachim Hagenauer and Peter Hoeher. A Viterbi Algorithm with Soft-Decision Outputs and its Applications. Proc. GLOBECOM IEEE, 3:16801686, November 1989. [8] Pablo Ituero Herrero. Implementation of an ASIP for Turbo Decoding. Masters thesis, KTH, May 2005. [9] Olaf Joeressen, Martin Vaupel, and Henrich Meyr. High-Speed VLSI Architectures for Soft-Output Viterbi Decoding. Proc. IEEE ICASAP92, Oakland, California,, pages 373384, August 1992. [10] D. W. Kim, T. W. Kwon, J. R. Choi., and J. J. Kong. A modied two-step SOVAbased turbo decoder with a xed scaling factor. Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE International Symposium on, 4:3740, May 2000. [11] Lang Lin and Roger S. Cheng. Improvements in SOVA-Based Decoding For Turbo Codes. Communications, 1997. ICC 97 Montreal, Towards the Knowledge Millennium. 1997 IEEE International Conference on, 3:14731478, June 1997.

74

BIBLIOGRAPHY

[12] Lutz Papke and Patrick Robertson. Improved Decoding with the SOVA in a Parallel Concatenated (Turbo-code) Scheme. IEEE International Conference on Communications, Conference Record, Converging Technologies for Tomorrows Applications., 1:102106, June 1996. [13] C.B Shung, P.H. Siegel, G. Ungerboeck, and H.K Thapar. VLSI Architectures for Metric Normalization in the Viterbi Algorithm. Communications, 1990. ICC 90, Including Supercomm Technical Sessions. SUPERCOMM/ICC 90. Conference Record., IEEE International Conference on, 4:17231728, April 1990. [14] Oscar Y. Takeshita. On Maximum Contention-Free Interleavers and Permutation Polynomials over Integer Rings. Submitted as a Correspondence to the IEEE Transactions on Information Theory, April 2005. [15] T.K Truong, Ming-Tang Shih, Irving S. Reed, and E. H. Satorius. A VLSI Design for a Trace-Back Viterbi Decoder. Communications, IEEE Transactions on, 40:616624, March 1992. [16] Matthew C. Valenti. Iterative Detection and Decoding for Wireless Communications. PhD thesis, Virginia Polytechnic Insitute and State University, July 1999. [17] Yan Wang, Chi-Ying Tsui, and Roger S. Cheng. A Low Power VLSI Architecture of SOVA-based Turbo-code decoder using scarce State Transition Scheme. IEEE International Symposium on Circuits and Systems, Geneva, Switzerland, 00:0000, May 2000. [18] Zhongfeng Wang. High Performance, Low Complexity VLSI Design of Turbo Decoders. PhD thesis, University of Minnesota, September 2000.

You might also like