Gpus For Ofdm Based SDR Prototyping: A Comparative Research Study

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

GPUs for OFDM based SDR Prototyping: A

Comparative Research Study


James Chacko, Danh Nguyen, Cem Sahin, Nagarajan Kandasamy, and Kapil Dandekar
Drexel Wireless Systems Lab, Electrical and Computer Engineering
Drexel University, Philadelphia, PA 19104
Email:{jjc652, dhn2, cs86, kandasamy, dandekar}@drexel.edu
AbstractThis paper presents a comprehensive research
study for the application of Graphic Processing Units (GPU)
in software-defined radio (SDR) prototyping. We introduce the
paradigm of SDR in wireless communications, which implements
all or part of the physical layer processing kernels in software.
This provides easy adaptation to multiple wireless standards,
fast time to market, higher chip volumes, and support for later
implementation changes. We identify the architectural elements
neccesary for efficient implementations of SDR on the GPU. Based
on these observations, we implemented several communication
baseband kernels in the MATLAB CUDA framework. Our
experimental results indicate that CUDA automatic vectorization
is only advantagous for the IFFT kernel with sufficient data sizes.
All other baseband kernels need further optimizations in order
to run effectively on the GPU.
KeywordsGraphical Processing Unit, Compute Unified Device Architecture, Software Defined Radio, Orthogonal Frequency
Division Multiplexing

I.

I NTRODUCTION

Wireless protocols are often implemented in custom hardware in order to satisfy the heavy computational requirements
within the low power margins available. Hardware implementations take longer to design and verify and therefore
require longer development times. A programmable software
implementation of the physical layer, also called Software
Defined Radio (SDR), is therefore very advantageous in terms
of supporting multiple protocols, faster time-to-market, higher
chip volumes and easy modifications. For SDRs to comply
with the balance of power and processing involved, it is
often necessary to choose the right underlying architecture
for their implementation. In this study, we discuss a few
architectural methods used for vectorization of code, such
as Single Instruction Multiple Packed Data (SIMpD), Single
Instruction Multiple Disjoint Data (SIMdD), and Very Long
Instruction Word (VLIW) for accomplishing parallel data
computation. Once we analyze and establish vectorization as
an intuitive way to implement SDR standards, well further
explain why vectorization is advantageous for SDRs in terms
of implementing different standards on the same device.
This paper presents our initial research findings with results
from our implementation of a generic physical layer for an
Orthogonal Frequency Division Multiplexing (OFDM) based
wireless standard. The results specifically look at the speedup
achieved in the modules that constitute the SDR baseband
(such as coding/decoding, modulation/demodulation, interleaving/deinterleaving, ifft/fft) implemented on graphic processors

(GPUs) versus an implementation of these modules in MATLAB on a Core i7 machine. We do this by interfacing with
Nvidia GPU computing platform, Compute Unified Device
Architecture (CUDA), from within the MATLAB framework
for direct comparison of different areas. We conclude our
research study through characterizing the architecture required
for building a GPU accelerated platform for quick and easy
prototyping of SDR applications.
II.

S OFTWARE D EFIED R ADIO (SDR)

A. SDR: A generic overview


Software Defined Radios (SDRs) largely consist of two
areas: a hardware focused area and a software focused area.
The hardware side mainly consists of antennas, A/Ds and D/As
and the software side consists of three groups consisting of the
filter stage, modem stage and the codec stage. The software
side is highly volatile and therefore significantly different
with respect to the different radio standards implemented.
The filter stage focuses on enforcing band limitation and is
placed right before and after the D/A and A/D respectively.
Filers in an SDR have high computational loads (2-5 billions
multiplications and additions per second for Universal Mobile
telecommunication System (UMTS) standard) and is often implemented by a configurable filter instead of having a generic
DSP implementation due to its large power requirements
[1]. The modem stage is also called the inner transceiver.
It is the most diverse amongst different standards and key
in signal conditioning, which involves rake reception, correlation, synchronization, detection, equalization, FFT, OFDM,
mapping, de-mapping, matrix multiplications, inversions, etc.
This is the most intensively researched stage of SDR since
it has room for algorithm evolution to improve throughput
and performance leading to better BER and power constraints.
The codec stage is called the outer transceiver and takes care
of data manipulation outside the immediate frame/symbols
of data which involves encoding, decoding, interleaving, deinterleaving and a variety of established channel algorithms
like Turbo, Reed-Solomon and Viterbi. This stage is heavily
computational and state machine like implementation of these
kernels make it a good area to provide hardware support to
take it off the critical path while implementing SDR.
B. Demand for SDR
The numbers of communication standards available nowadays are numerous and therefore mobile devices being able
to work across standards are very attractive and extensively

researched [1][6]. The major challenge of an SDR implemented to cross standards would then be, the ability to realize
multiple giga instructions per second (GIPS) of the flexible
base baseband processing under power limitations and time
constraints. In order to implement a crossover between two
standards, the baseband must be flexible enough to have one
active standard with the ability to keep sniffing for other
available standards to connect to. The challenge here is to
keep the computational load within bounds of the embedded
processor while crossing over avoiding a break in service. High
performance demands with power and throughput restrictions
have always been the concern for devices running digital signal
processing (DSP) algorithms. Time after time these digital
signal processing algorithms have constantly been changed,
modified and customized to suit their underlying architecture
for better performance. With the availability and increase
of multiple-simultaneous processing power the major front
to optimize these DSP techniques now lay in parallelism.
Parallelism will give these DSP kernels the edge in performing
multiple computations simultaneously and thereby make the
research into its adaptations and limitations into SDR significant.
III.

PARALLEL A RCHITECTURES TO I MPLEMENT SDR S

A. Instruction Vs. Data Level Parallelism


The two different sets of parallelism are Instruction Level
Parallelism (ILP) and Data Level Parallelism (DLP). Instruction level parallelism involves setting up instructions to run
concurrently at the operation level once the data dependencies
between these concurrently scheduled operations are either
taken care of or removed with architectural techniques. Data
level parallelism on the other hand is used to run the same
set of operations on multiple independent data sets. It is
occasionally very noticeable that computationally heavy numerical applications have room for data level parallelism while
control instructions do not. A complete application involving such numerically intensive kernels on frames/symbols of
data as seen in SDR has potential for both instruction and
data level parallelism. Here instruction level parallelism takes
care of similar computations required to be done amongst
frames/symbols while data level parallelism takes care of
parallelizing computations within the frame/symbol. Data level
parallelism on data elements narrower than the normal 32 bits
is called sub-word data level parallelism. Implementation of
parallelism is not always that straightforward. While super
scalar machines can detect instructions level parallelism on
hardware all other approaches requires the instruction level
parallelism, data level parallelism and sub word parallelism to
be explicitly exposed by either the compiler or the programmer.
B. SISD vs. VLIW vs. SIMpD vs. SIMdD vs. indirect-SIMdD
The three main parallel architectures we are going to
discuss in this research are Very Long Instruction Word
(VLIW), Single Instruction Multiple Packed Data (SIMpD)
and Single Instruction Multiple Disjoint Data (SIMdD) [2].
Implementations of such architectures are complex as it involves overcoming bottlenecks that calls for improved address
generations, loop handling, data reordering, matrix oriented
computations, data alignment, vector permutation and extensions in the design to take care of performing vector

calculations on super scalar architectures present versus inorder vector processors with long words. One of the major
differences in the above mentioned architectures lay in the
hardware involved in doing the required data accesses. Single
Instruction Single Data (SISD) architecture, which is the
simplest kind, takes care of a sequential instruction stream
to do work and gives out an output stream. The requirements
of just one register containing data to work on per instructions
makes it simple to implement and avoids any extra design
to avoid data dependencies between the different instructions
being run. Very Long Instruction Word (VLIW) architecture
uses register streams in order to handle the data access to do
simultaneous and different processing of multiple independent
data. The instructions stream in VLIW is a combination of
multiple operations to be computed on different data, taking
into account, the output dependencies of the computations dont
overlap and even if they did not cause any computational
errors due to out of sequence execution. If VLIW as described
above when designed in a way to have single sequential
instruction to be run on multiple data elements, leads to the
Single Instruction Multiple Data architecture. Based on the
application, the data being worked on come from different data
registers and can be written back to same or disjoint locations.
The latter case leads to the development of Single Instruction
Multiple Disjoint Data (SIMdD). SIMdDs are however harder
to implement and not currently used due to the complexity it
has in terms of additional hardware and control of multiple
registers. Another technique involves multiple data elements
for SIMD being packed together into one single register often
called SIMPD architecture. An architecture similar to SIMdD,
where disjoint data are accessed through vector pointers and
this architecture is thus known as indirect-SIMdD architecture.
Instead of explicitly specifying vector elements, in indirectSIMdD, pointers to the source and destination elements are
provided and vectors fields specifies multiple indices.
IV.

SDR BASEBAND A RCHITECTURE

The main components of a generic baseband architecture


that uses vector processors are microcontroller, memory, configurable filters, DSP chip and configurable channel decoders
[1]. The microcontroller chip on board is mainly for Link/MAC
layer processing and takes care of control/communication of
the various baseband components and RF tasks. The on-board
DSP is for legacy code support like speech codecs and scalar
algorithms that otherwise done on a vector processor would be
a waste of resources due to its known sequential dependencies.
One or two channel decoders like Viterbi, Turbo, etc is present
that can be weakly configurable with the standard being focused on. Channel decoders would require hardware support to
keep it off the critical path of the system. This can be achieved
even by having more than one channel decoder on board
to switch between cases where the differences in efficiency
are not nominal. Configurable channel filters are present to
band limit the standards. The main component would be the
vector processor which will be used for number crunching and
adding multi standard flexibility. In a generic vector processor
suitable for SDR, even though the core is SIMD, a VLIW
execution model is used in order to support parallelism among
multiple vector processing units. Having a VLIW execution
model also makes if flexible to run multiple scalar functional
units simultaneously. We will consider two architectures based

on this model to see the benefits of implementing SDR with


it. The two considered architectures are OnDSP architecture
and EVP architecture [1], [7]. The OnDSP vector processor
is the key component in several multi standard programmable
Wireless LAN baseband ICs [1], [6]. In this architecture a
single VLIW instruction can specify a number of vector operations like load/store, ALU, MAC, address calculations and loop
control. OnDSP also have specific vector operations allowing it
to word insertions/deletion, sliding and gray coding/decoding.
The Embedded Vector Processor (EVP) is a versatile standard
originally developed to support 3G standards [1]. This architecture is large enough to cover the functionality of OnDSP
as well OFDM standards. Its SIMD element is scalable and
having an enormous width can help it to generically sustain five
vector operations, four scalar operations, three address updates
and loop control. The EVP has additional components with
respect to the OnDSP considering the large load it handles.
The additional FUs consist of a shuffle unit, a code generation
unit and an intra vector unit. The Shuffle unit is included to
be able to rearrange the elements of a single vector to an
arbitrary pattern. The code generation unit support CDMA
code generation. The intra vector processor is used to do
operations like maximum, add, etc operations of elements of
a single vector if required. Related research in this area have
clearly shown based on the computational loads of different
standards, utilizing an EVP therefore leaves head room for
more computations and therefore very advantageous for SDR
implemented to cross standards [1]. This available computation
therefore leaves room for introducing more improved and
demanding algorithms and enable to run multiple standards
simultaneously to ease into handoffs.
V.

A RCHITECTURE DIFFERENCES AND TRADEOFFS

It is necessary to know the different implementation aspects


of these parallel methods in order to understand more into
applications where in can be advantageous. In terms of VLIW,
the implementation is straight forward as doing multiple
operations on multiple data. For instance a VLIW which
contains 4 opcodes for four operations will separately have
three registers which maps to two sources and one destination.
VLIWs are the most commonly used architecture due to the
ease in complexity. SIMpD on the other hand need a special
vector register to access data and have alignment requirements.
. SIMdD does not have alignment requirements as data access
is based on vector pointers and therefore the number of actual
data elements being worked on can be much more than both
SIMpD and VLIW implementations. The core architectural
tradeoffs seen with the above parallelism methods are based
on number/size of data memory, instruction memory, data
memory ports, register files, functional units and compatibility
across the wide range of application specifics. Data memory
alignment matters to three of the four architectures described
and thereof there may or may not be wastage based on
application utilizing vectors requiring padding. Instruction size
varies depending on the architecture used, as in order to
fully utilize these parallel architectures, techniques such as
loop unrolling, rotating register, variable aliasing and renaming
have to be considered which increases in code size. These
architectures have specific data port requirements in order
to perform without choking for data or being underutilized.
The process of getting data organized to the vector length

required for data level parallelism, known as vectorization,


now can be done automatically by compilers referred to as
auto-vectorization. But a compilers efficiency to detect and
implement auto vectorization for DLP largely depends on the
type of code. Compilers often find it hard to do vectorization
on SIMDs as it has to account for vector alignments and
therefore require manual tweaking. The more regular structures
of VLIWs make them more compiler friendly and therefore
benefits from auto vectorization.
VI.

GPU I MPLEMENTATION AND R ESULTS

This paper shows the results of our initial attempts to


reproduce the benefits of parallelization through running an
OFDM communication model on a Graphical Processing Unit
(GPU). We developed an OFDM communication baseband in
Matlab and used a run time platform called Jacket, by Accelereyes [8], to run the code on a Tesla C1060 GPU that has 20
processing cores. Unlike Embedded Vector Procesors (EVP)
[1] and Signal Processing On-Demand Architecture(SODA)
[3] where the codes were optimized to run SDR, Jacket
does not provide tools to micro manage the parallelization
process instead it does the parallelization process on set rules
by manually vectorizing the Matlab code. We coded jacket
around all the major kernel processing elements of the OFDM
communication model and observed runtime results on it
against running in on on a Core i7 processor.
We ran our communication baseband layer under two
different modulation schemes(4QAM and 16QAM) and four
different subcarrier sizes(64, 128, 256 and 512) to have an
understanding of how the GPU would get utilized. The resulting comparison figures can be seen below. On comparison the
main module that showed a clear runtime advantage on a GPU
was the IFFT-FFT kernel. As expected the GPU computing
IFFT-FFT processing did not show any significant advantage in runtime till the number of ofdm symbols transmitted
was increased to a significant size for a 512 pt FFT doing
4qam/16qam modulation. =The reason behind this is the fact
that any number of ofdm symbols less that this underutilized
the GPU in its current level of code optimization. We should
also keep in mind that the GPU core is being compared against
a Intel Core i7 64 bit processor which itself is a significantly
fast. Other processing modules of the ofdm communication
kernels did not show much increase in runtime which might
be due to the fact that the code was not exactly customized
to be efficiently run on a GPU since the parallelization was
completely automated and due to sequential dependencies
inherent to signal. The fact that a speedup was seen in the
ifft/fft assures us that parallelization is definitely an area of
interest and there are current SDR platforms where you already
see GPU accelerators [9]. The challenge remains to implement
the full communication baseband optimally for GPUs as well
as optimizations that can run smaller number of ofdm symbols
efficiently.
VII.

C ONCLUSION

This paper we discussed the advantages of using a Software


Defined Radio (SDR) and the benefits of implementing vectorization using parallel architectures in achieving flexibility. We
compared different parallel architectures and their differences
and saw the significance in using a combination of VLIW

Fig. 1. CPU vs GPU ifft/fft runtime against Number of OFDM symbols with
baseband set to 512 subcarriers and 4QAM modulation

Fig. 2. CPU vs GPU Encoding runtime against Number of OFDM symbols


with baseband set to 512 subcarriers and 4QAM modulation

Fig. 4. CPU vs GPU ifft/fft runtime against Number of OFDM symbols with
baseband set to 512 subcarriers and 4QAM modulation

Fig. 5. CPU vs GPU ifft/fft runtime against Number of OFDM symbols with
baseband set to 512 subcarriers and 4QAM modulation

VIII.

ACKNOWLEDGMENT
R EFERENCES

[1]

[2]

[3]

[4]

Fig. 3. CPU vs GPU ifft/fft runtime against Number of OFDM symbols with
baseband set to 512 subcarriers and 4QAM modulation

[5]

[6]

and SIMD to being able to run multiple standards from


the extra room for computation. As the initial step for this
research, we synthesized an OFDM communication model to
run on a Tesla C1060 GPU against a Core i7 CPU to capture
results showing first signs of speedup from the ifft/fft block
as expected. Our future work includes further optimizing the
code to run more modules of the ofdm baseband layer more
efficiently on the GPU and building a test bench that is not
constrained by the MATLAB tool chain to exclude MATLABGPU communication overhead.

[7]

[8]
[9]

C. van Berkel, F. Heinle, P. P. Meuwissen, K. Moerman, and M. Weiss,


Vector processing as an enabler for software-defined radio in handsets
from 3G+WLAN onwards, in Proc. of SDR 2004, 2004, pp. 125130.
C. Kozyrakis and D. Patterson, Vector Vs . Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks, in Proc. of MICRO35 02, no. November, 2002, pp. 283293.
M. Woh, Y. Harel, S. Mahlke, T. Mudge, C. Chakrabarti, and
K. Flautner, SODA: A Low-power Architecture For Software Radio,
in Proc. of ISCA 06. Ieee, 2006, pp. 89101. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1635943
J. Kim, S. Hyeon, and S. Choi, Implementation of an SDR System Using
Graphics Processing Unit, IEEE Communications Magazine, vol. 48,
no. 3, pp. 156162, 2010.
W. Plishker, G. F. Zaki, S. S. Bhattacharyya, C. Clancy, and
J. Kuykendall, Applying graphics processor acceleration in a
software defined radio prototyping environment, in Proc. of
RSP 2011. Ieee, May 2011, pp. 6773. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5929977
L. Han, J. Chen, C. Zhou, Y. Li, X. Zhang, Z. Liu, X. Wei,
and B. Li, An embedded reconfigurable SIMD DSP with
capability of dimension-controllable vector processing, in Proc.
of ICCD 2004. Ieee, 2004, pp. 446451. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1347960
P. Westermann, G. Beier, H. Ait-harma, L. Schwoerer, N. Siemens, and
N. Deutschland, Performance Analysis of W-CDMA Algorithms on a
Vector DSP, in Proc. of ECCSC 2008, 2008, pp. 307311.
NVIDIA
CUDA.
[Online].
Available:
http://docs.nvidia.com/cuda/index.html
GNU Radio. [Online]. Available: gnuradio.org

You might also like